0% found this document useful (0 votes)
110 views246 pages

1.fundamentals of 1D Visualization

Uploaded by

Shashank S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
110 views246 pages

1.fundamentals of 1D Visualization

Uploaded by

Shashank S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 246

CS F441

Data Visualization
SUMANTA PATTANAIK
 Sumanta Pattanaik: Visiting Professor
 Email: [email protected]
pilani.ac.in

Instructors 


Tathagat Ray: Professor, Hyderabad campus
Vinayak Naik: Professor, Goa Campus
 Sundaresan Raman: Professor, Pilani Campus

8/21/2023 2
 What will be covered ?
 Principles and techniques for interactive data
visualization that are useful for presenting and
analyzing and presenting information associated

Course 
with the data.
Algorithmic aspects of developing interactive
Particulars visualization.
 The students will receive practical experience of
creating visualization using
 Python based tools for data analysis and Plotly's
Python graphing library and Matplotlib for data
visualization.
Course  Text book: Fundamentals of Data Visualization
by Claus O. Wilke
Particulars  Online link: https://clauswilke.com/dataviz/
 Programming Knowledge: An object based
language expertise is essential.
 Computer Programming and Data Structure, OO
Prerequisite Programming.
 NOTE: This class has a strong programming
component. All the assignment and project will
be done using Python, Plotly and Matplotlib.
 BITS Policies
 Students are expected to familiarize themselves
with and follow the BITS Rules of Conduct.
General  Talk to your local faculty instructors

Policies  Plagiarism in projects in discouraged


 Penalties can include a failing grade in the course,
or suspension or …as per BITS disciplinary rules.
 Evaluation Policies
 Regular Visualization assignments [20%]:
 Data cleaning and Visualization
General  Regular (weekly) Quizzes[20%]

Policies 


Mid Term [20%]
Comprehensive [30%]
 Final Project [10%]
 Assignment Policies
 Assignments must be turned in by deadline, mostly
set to 11:55 pm of the date.
 Programming Assignments must be carried out using
Jupyter notebook.
General  Assignment Submission is through Google
Classroom. (submit assignment_xx.ipynb file).
Policies  Programming in Python, Visualization using Plotly,and
Matplotlib
 Compute Platform:
 Any computer/OS running python and Jupyter
 Students will work on the assignment independently.
 Quizzes: 20 points. (open book)
 will be conducted weekly, online, during class
hours.

General  Questions will be from the material covered the


previous week.

policies  Midterm: 20 points. (closed book)


 Check your campus calendar for the date/time
 Comprehensive: 30 points. (closed book)
 Check your campus calendar for the date/time

8/21/2023 29
 Quizzes: 20 points. (open book)
 will be conducted weekly, online, during class
hours.
 Questions will be from the material covered the
previous week.
 Midterm: 20 points. (closed book)
General  Check your campus calendar for the date/time

policies  Comprehensive: 30 points. (closed book)


 Check your campus calendar for the date/time
 Final Project: 10 points (open book)
 Final Project can be a group project
 Project can be chosen in discussion with the
local instructors and presented by the last class
day.

8/21/2023 30
Topics

 Data visualization overview.


 Data Manipulation and Data Processing
 Data visualization and Visual Analysis
 Perceptual and design principles for effective data visualization.
 Interaction: Concepts and techniques.
 High dimensional data visualization.
Rough plan for the next few Classes

 Lectures
 Overview
 Getting started with Python on Jupyter
 Getting started with Plotly, Matplotlib/seaborn
 Assignments will be mostly on getting the class familiarized with
 Python, NumPy and Pandas
 Plotly and Matplotlib for visualization
Get Started

 Python setup: https://www.python.org/about/gettingstarted/


 Working with Jupyter Notebook: https://realpython.com/jupyter-
notebook-introduction
Questions?
Overview
What is Data  Visualization
 Visualization is the process of understanding
Visualization? something by creating its mental picture.
Data Visualization
What is Data

 understanding of the data by creating its mental

Visualization? 
picture.
Any technique that helps in creating the
mental picture will be called data visualization.
Table for Visualization

 Example: CECS Graduate Student Diversity

If you want to show the exact amount of every value in your


data, a table might be your best solution.
Anatomy of a Table

10 primary component.
Source: “Tables”, Chapter 11, Better Data Visualization.
Table for Visualization

 What about the following data?


What is Data  Human brain is better at understanding visual
imagery compared to numbers or words.
Visualization?
 Anecdotal evidence: People remember:

Brain and
Visual 
[https://blog.hubspot.com/agency/science-brains-crave-infographics]
Verified Evidence:

Information 50% of the brain is devoted to visual processing.


A majority of sensory receptors are in our eye.
The brain can identify images seen for as little as 13 milliseconds.
[http://news.mit.edu/2014/in-the-blink-of-an-eye-0116]
 Visualization through visual imagery has been an effective way to
communicate both abstract and concrete ideas since the dawn of
humanity. [wiki]
 Data visualization is the process of creating a visual
What is data representation of the data for accelerating its
understanding.

visualization  Hence, the process of creating the visual form has


become synonymous with data visualization.
“The greatest value of a visual (picture) is when it
Data

forces us to notice what we never expected to
see”
Visualization [John W. Tukey author of Exploratory Data Analysis]
Anscombe’s Quartet

Value of
Data
Visualization
 Sampled data sets from “Graphs in statistical
analysis”. by F. J. Anscombe, in American
Statistician, 27, 17–21 (1973)
Anscombe’s Quartet

 Sampled data sets from “Graphs in statistical


analysis”. by F. J. Anscombe, in American
Statistician, 27, 17–21 (1973)

See: Chapter 1, Better Data Visualization.


Anscombe’s Quartet

Value of  Visualization of Sampled data sets from

Data “Graphs in statistical analysis”. by F. J.


Anscombe, in American Statistician, 27, 17–21
(1973)
Visualization

See more from : https://dabblingwithdata.wordpress.com/2017/05/03/the-datasaurus-a-monstrous-anscombe-for-the-21st-


century/
RECAP:  Presenting data in ways that allow us
Data  to visually observe patterns, exceptions, and the
possible stories behind the raw data.
Visualization
RECAP
Anscombe’s Quartet

 Sampled data sets from “Graphs in statistical analysis”. by F. J.


Anscombe, in American Statistician, 27, 17–21 (1973)
See: Chapter 1, Better Data Visualization.
RECAP
Anscombe’s Quartet

 Sampled data sets from “Graphs in statistical analysis”. by F. J.


Anscombe, in American Statistician, 27, 17–21 (1973)
See: Chapter 1, Better Data Visualization.
 Presenting data in ways that allow us
to visually observe patterns, exceptions, and the
Data

possible stories behind the raw data.

Visualization
 It is much easier to discover and confirm the
presence (or even absence) of patterns,
relationships, and physical characteristics (such
as outliers) through visualization.
 Addresses a variety of needs:
 to evaluate data.

Data 


to communicate to peers.
to convince the board/reviewers.
Visualization  to present to clients.
 to report to regulatory committee.
 Dates back to 2nd Century: Egyptians used
History of maps for Earthly and heavenly positions

Data  10th century: For time series plot of Celestial


bodies

Visualization  14th century: For plotting Mathematical


functions
 William Playfair: (1759-1823) is credited to be
History of the most influential person in data visualization.
Inventor of statistical plots such as
Data

 line chart

Visualization 


bar chart
pie chart
 Record information
 Blueprints, photographs, seismographs, …
 Analyze data to support reasoning
Value of  Develop and assess hypotheses
Discover errors in data
Data

 Expand memory

Visualization  Find patterns


 Communicate information to others
 Share and persuade
 Collaborate and revise
 Record information

Value of 
 Blueprints, photographs, seismographs, …
Analyze data to support reasoning
Data  Develop and assess hypotheses
Discover errors in data
Visualization:

 Expand memory

Record  Find patterns


 Communicate information to others
Information  Share and persuade
 Collaborate and revise
 Napolean’s Disastrous Russian campaign of 1812.
French invasion of Russia
A cartographic depiction
of numerical data on a
map of Napoleon's
disastrous losses
suffered during
the Russian campaign
of 1812. The illustration
depicts Napoleon's army
departing the Polish-
Russian border.

Minard's interest lay with the painful efforts and sacrifices of the soldiers.
 Napolean’s Disastrous Russian campaign of
The graphic is notable 1812.
for its representation in
two dimensions of six
types of data:
1- the size of
Napoleon's troops;
2- distance;
3- temperature;
4- the latitude and
longitude,
5-direction of travel;
and
6- location relative to
specific dates.
See:
https://www.britannica.com/event/French-
invasion-of-Russia
 Napolean’s Disastrous Russian campaign of
The graphic is notable 1812.
for its representation in
two dimensions of six
types of data:
1- the size of
Napoleon's troops;
2- distance;
3- temperature;
4- the latitude and
longitude,
5-direction of travel;
and
6- location relative to
specific dates.

Yet another visualization.


Sankey Chart/Sankey Diagram

 A type of flow diagram in which


the width of the arrows is
proportional to the flow rate of the
depicted property.
 named after Irish Captain
Matthew Henry Phineas Riall
Sankey, who used this type of
diagram in 1898
 Minard's Map is a flow map,
overlaying a Sankey diagram onto
a geographical map.
 It was created in 1869, predating
Sankey's first Sankey diagram of
1898.
Value of  Record information
 Blueprints, photographs, seismographs, …
Data  Analyze data to support reasoning

Visualization: 


Develop and assess hypotheses
Discover errors in data
Analyze data  Expand memory

to support  Find patterns


Communicate information to others
reasoning

 Share and persuade


 Collaborate and revise
 John Snow’s famous Study of 1854 Cholera

Analyze outbreak in Broad Street London.


 Physician John Snow, the supporter of Germ theory
data to and the hypothesis that the Cholera is spreading
through contaminated water, used visualization to

support
prove his hypothesis that contaminated water, not air,
was the source of cholera.

reasoning  This finding came to influence public health and the


construction of improved sanitation facilities beginning
in the mid-19th century.

Src:
• https://en.wikipedia.org/wiki/1854_Broad_Street_cholera_outbreak
• https://www.arcgis.com/home/item.html?id=d7deb67f810d46dfacb80ff80ac224e9
 John Snow’s famous Study of 1854 Cholera
outbreak in Broad Street London.

Analyze
data to
support
reasoning

https://www.arcgis.com/apps/PublicInformation/ind
ex.html?appid=d7deb67f810d46dfacb80ff80ac224e9
 John Snow’s famous Study of 1854 Cholera
outbreak in Broad Street London.

Analyze
data to
support
reasoning
 John Snow’s famous Study of 1854 Cholera
outbreak in Broad Street London.

Analyze
data to
support
reasoning

Snow used his map to convince local authorities to remove the handle of the Broad Street pump that prevented many deaths.
The removal of the Broad Street pump handle has become the stuff of legend. At the Centers for Disease Control (CDC) in
Atlanta, when scientists look for simple answers to questions about epidemics, they sometimes ask each other,
“Where is the handle to this pump?”
Value of  Record information
 Blueprints, photographs, seismographs, …
Data  Analyze data to support reasoning

Visualization: 


Develop and assess hypotheses
Discover errors in data
Communicate  Expand memory

information to  Find patterns


Communicate information to others
others

 Share and persuade


 Collaborate and revise
 Nightingale’s Coxcomb Graph: To
communicate and convince
 Nightingale (lady of the lamp)—a British nurse and
Communicate social reformer, in 1858 developed this chart type to
represent the causes of death of British soldiers
information to during the Crimean War graphically (how people had
died during the period from July, 1854, through the
others end of the following year.
 Less well known is that she was a leading
statistician and pioneer in the visual presentation
of information and statistical graphics.

Crimean was fought between Russia and an ultimately victorious alliance of the
Ottoman Empire, France, the United Kingdom and Piedmont-Sardinia.
Communicate
information to
others
 The circle is divided into twelve equal "slices" representing each
month of the year. Months with more deaths are shown with longer
wedges, so that the area of each wedge represents the number of
deaths in that month from disease (Blue) wounds (Red), or other
causes (Black) . In the second year of the war (shown in the left
image), deaths from disease were greatly reduced, showing the
effect of the improved hygiene in the camps and hospitals starting in
March 1855.
Communicate
information to
others
 Once you see Nightingale's graph, the terrible
picture is clear. The Russians were a minor enemy.
The real enemies were cholera, typhus, and
dysentery. The chart indeed resulted in the
modernization of British army hospital system.
 Interactive visualization allows better
exploration of the data.
 For a visualization to be considered interactive
it must satisfy two criteria:

Interactive  Human input: control of some aspect of the


visual representation of information, or of the

Visualization 
information being represented, and
Response time: changes made through input
must be incorporated into the visualization in a
timely manner. In general, interactive
visualization is considered a soft real-time task.
Interactive
Visualization:
An Example
 Disappearing Shorelines
 Source:
https://archive.nytimes.com/www.nytimes.com/i
nteractive/2012/11/24/opinion/sunday/what-
could-disappear.html
CS F441: Data
 Learn how to examine data and relationship
Course among variables though visualization and
statistical tools with a goal towards
Goal:  Building insight into the data & process that
generated the data.
Exploratory  Finding out what may be interesting.

Data  Determining which variables have the most


predictive power.

Analysis  Assessing and validating assumption in which


future inference will be based.
Visual
Analysis of
Old Faithful
Data

The famous geyser in Yellowstone National Park, Whyoming, USA. See


https://www.wonderopolis.org/wonder/why-is-it-called-old-faithful/
 Old Faithful Data has eruption time and waiting
Visual 
time to next eruption (both in minutes).
Data is an array (Table)
Analysis of  272 element (observations) of 2 variables:

Old Faithful 


eruptions: 3.6 1.8 3.33 2.28 4.53 ...
waiting : 79 54 74 62 85 55 88 85 51 85 ...

data  the name of columns in the data table:


 “eruptions” and “waiting”
 A histogram is used to graphically summarize
and display the distribution of a 1D data set.
 Divides the range into bins and counts the
Visual number of events in each bin (also called
frequency).
Analysis:  The “y” axis in histogram is usually the count or
frequency of a measurement in the
Histogram corresponding bin.
Waiting time
vs Eruption time
 Effective data visualization is an
art and science
Data  Art:Requires design and
communication skills that appeals to
Visualization: viewers at an aesthetic level
 Science:
Art and Requires an understanding
of cognitive science and visual
Science perception. The understanding of
how the eye, brain work together to
processes the information on visual
signals.
 Human Visual functions are
extremely fast and efficient
 Cognitive functions are much
Designing slower and less efficient.
Visualization  Designing visual functions that
should take advantage of the
strengths of visual functions and
help in cognitive function.
 Data Component
 Geometry/Graphic Component

Main  ex: scatter plot, barplot, histogram, smooth


densities, boxplot.

Components  Aesthetics Component


visual cues to represent the information provided
of Data

by the dataset. Ex:

Visualization  Position: two most important cues in this plot are


the point positions on the x-axis and y-axis.
 Color, Shape, Size, …: other visual important cues
 Understanding the differences in data types is
critical:
they determine which statistical analysis will be
Data Types

valid for that data
 what type of plots are appropriate.
Three different data types:
 Nominal Data:
 categories or labels or descriptions

Data Types  Ordinal Data:


 Has associated order/rank.
 Quantitative Data
 Any numbers that express amount or quantity.
 Examples include names, gender, hair color,
product types, …
 a discrete classification of data,
 data are neither measured nor ordered
 but are merely allocated to distinct categories.

Nominal  Can not compare such data:

Data 
 operations: equal, not equal
for example
 a record of students' course choices
constitutes nominal data.
 Male (M), Female (F)
 Hair color: Brown, Black, Blonde, Gray, Other
 Data that can be quantified.
 Data that are answers questions such as “How

Quantitative 
many?”, “How often?”, “How much?”.
In general, 2 categories
data  Continuous: can take of any numeric value
 Discrete:
 ex: Count
 Categorical, statistical data type where the variables have natural,
ordered categories
 There is an order in the values: (operations: equal, not equal;
less/more).
 first place, second place, third place; size S, M, L).
 Note: the distances between the categories is not known. (e.g. a
scale ranging from happy to indifferent to unhappy).
 The ordinal scale is distinguished

Ordinal 


from the nominal scale by having ordered categories.
from continuous scales by not having category widths that

Data
represent equal increments of the underlying attribute.[Wiki]
 Examples:
 Likert Scale
 Answer to survey question "Is your general health poor, reasonable,
good, or excellent?“. The answers may be coded as 1, 2, 3, and 4.
 Individuals income might be grouped into the income categories
$0-$19,999, $20,000-$39,999, $40,000-$59,999,…
 socioeconomic status, military ranks.
 letter grades for coursework.
A Typical
DataSet

Source: https://datacatalog.worldbank.org/dataset/world-development-indicators
A Table of records (rows)
 elements in the same row are related to each
other in the sense that they are all measures
from the same observation---or measures of the
same item.
 Each Record has a number of observations
(columns)
A Typical  Called Items, Dimensions, Variables

Dataset  elements in the same column are related to


each other in the sense that they are all
measures of the same metric
 Dependent or Independent
 Dependent variable is affected by variation of
some other variable: Ex: Date dependent
Temperature
 we may not know which variables are dependent
or independent.
 https://data.gov.in/: Open Govt. Data Platform
India
 https://data.worldbank.org/country/india

Sample 


https://data.oecd.org/india.htm
https://www.indiastat.com/
Data  https://ourworldindata.org/country/india

Sources  https://www.imf.org/external/datamapper/prof
ile/IND
 https://www.kaggle.com/datasets?tags=3023-
India
 …
Recap: Nightingale’s Rose Chart

Both the plots use


ehe same scale.
How much  To some extent

of 


Data Science & Python
General W3Schools Python
Python Tutorial

required?  Python for beginners


FOR CS F411

Quick Python Review


Language Basics

 Python is an interpretive language.

 Python is a strongly-typed and dynamically-typed language.

 Strongly-typed: Interpreter always “respects” the types of each


variable.

 Dynamically-typed: A variable is simply a value bound to a name.


Primitive Types

4 primitive data types  Built-in values:


 Numbers  True, False
 Integers: 5  None
 Floats: 5.2
 booleans : True, False
 Strings: ‘xyz’, “xyz”, ’’’xyz’’’
 ’’’This
is a
multiline string.’’’
Common Operations
>>> x = 10 >>> x / y
>>> y = 3 3.3333333333333335
>>> x + y
>>> x // y
13
3
>>> x - y
7 >>> x % y
1
>>> str(x) + str(y)
'103' >>> x ** y
1000
>>> x * y
30
string facts
 r-string: Raw string
>>> s = 'help '  Retain escape sequences in text
s = r'Hi\nProf’
>>> s * 3
'help help help ' >>> print (s)
Hi\nProf
 f-string: Formatted string. Allows
expression in string literal.
>>> A = f'Ask for {s}.'
>>> print(A)
Ask for help.
f-string makes it easier to read compared to using +
sign to concatenate the string and variable, or
using string formatting operations.
Multiple assignment

 x, y = 10, 5

 Can use multiple assignments to swap variables!


 y, x = x, y
Language Basics

Python is a strongly-typed and dynamically-typed language.


 Strongly-typed: 1 + ‘1’ → Error!
 Dynamically typed:
a = 1 : a is an integer variable
a = ‘str’ : a is a string variable.
Collection types

 4 types of Collection types:


 List
 Tuple
 Dictionary
 Set
Collection: List

Lists are mutable (changeable) arrays


Ex: names = [‘Zach’, ‘Jay’] # note the square bracket.
Indexing to access of individual

>>> names = ['Jack', 'Jill'] >>> names.append('Rick')


>>> names[0] >>> len(names)
'Jack' 3
>>> print (names) >>> print (names)
['Jack', 'Jill'] ['Jack', 'Jill', 'Rick']
>>> len(names) >>> names.extend(['Kevin', 'Adrian'])
2 >>> print (names)
>>> emptyList = [] or list() ['Jack', 'Jill', 'Rick', 'Kevin', 'Adrian']
>>> len(emptyList)
0
List Slicing

 A subsect of List elements can be accessed in convenient ways.


 Basic format:
 some_list[start_index:end_index]

numbers = [0, 1, 2, 3, 4, 5, 6]
numbers[0:3] == [0, 1, 2]
numbers[:3] == [0, 1, 2]

numbers[5:] == [5, 6]
numbers[5:7] == [5, 6]

numbers[:] == [0, 1, 2, 3, 4, 5, 6]
Collection: Tuple

 Tuples are an ordered collection, immutable (unchangeable)


 example: coordinates = (2., 5., 1.) # Note the parenthesis or round brackets.
 Element access is by indexing, like in list. # coordinates[0], coordinates[1], …
 len(coordinates) # returns 3
 emptyTuple = () or tuple()
 oneElemTuple = (1.5, ) #Comma matters!
Collection: Dictionary
Dictionary may be considered as an unordered list with key-value pairs.
 ex: course = { Dictionaty methods:
“number” : “ISC4551”, >>> course.keys()
“name”: “Data Graphics and dict_keys(['number', 'name',
Visualization”, 'classSize'])
“classSize”: 35 >>> course.values()
} dict_values(['ISC4551', 'Data Graphics
 Element access is done by using and Visualization', 35])
key as the index. >>> print(course.items())
 ex: course[“number”] dict_items([('number', 'ISC4551'),
('name', 'Data Graphics and
Visualization'), ('classSize', 35)])
Collection: Set

 An unordered, and unindexed collection.  Set membership test


>>> thisset = {"apple", "banana", >>> print("banana" in thisset)
"cherry"}
True
>>> print(thisset)
>>> print("pineapple" in thisset)
{'apple', 'banana', 'cherry'}
False
 No duplicate items: Sets cannot have two
 Add to set
items with the same value.
>>> thisset = {"apple", "banana", >>> thisset.add("pineapple")
"cherry", "apple"} >>> print("pineapple" in thisset)
>>> print(thisset) True
{'banana', 'apple', 'cherry’}  Alllows union, intersection,difference,

ref: https://www.w3schools.com/python/python_sets.asp
Additional Collections

 Available via external packages (or libraries):


 numpy
 pandas
 Must import these packages before using the collections
>>> import numpy
>>> import pandas
 Note: the packages are not natively available with python. Must be
installed independently prior to importing.
 pip install numpy
 pip install pandas
numpy

 NumPy is a Python package used for working with arrays.


>>> arrays = numpy.array([1, 2, 3, 4, 5])
 Difference between list and numpy.array:
 numpy.array
 fixed length
 Homogeneity of array elements
 Occupies continuous memory
 efficient
numpy.array

>>> A = [2,3,4] >>> A+5


>>> nA = numpy.array([2,3,4]) Traceback (most recent call last):
>>> type(nA) File "<stdin>", line 1, in <module>
<class 'numpy.ndarray'> TypeError: can only concatenate list
(not "int") to list
>>> type (A)
<class 'list'>
>>> print(nA+5)
>>> A.append(5)
array([7, 8, 9])
>>> print(A)
[2, 3, 4, 5]
>>> print (nA)
>>> nA.append(5)
[2 3 4]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'numpy.ndarray' object
has no attribute 'append'

More on numpy: https://numpy.org/devdocs/user/quickstart.html


Random number

 NumPy offers the random module to work with random numbers.


>>> from numpy import random
>>> print(random.randint(100)) # integer random number in the range [0, 100)
50
>>> random.randint(100, size=(5)) # integer random 1D array of size 5
array([14, 10, 85, 85, 11])
>>> print (random.rand()) # float random number in the range [0, 1)
0.8974755964397841
>>> print(random.rand(5)) # float random array of size 5
[0.0204437 0.18452146 0.32198496 0.4379359 0.70774701]
Random number (continued)

>>> random.uniform(low=0,high=2,size=5) # uniform distribution in [0,2)


array([0.1142441 , 0.48921499, 0.52053535, 1.39136392, 1.85754155])

>>> random.normal(loc=0,scale = 1, size = 5) # normal distribution


array([ 0.15476279, 0.89379376, 0.80774181, -0.74517664, -0.20817956])
note: loc is the mean and scale is the standard deviation

>>> print( random.choice([3, 5, 7, 9], p=[0.1, 0.3, 0.6, 0.0], size=5))


[7 7 5 5 7]
>>> random.permutation([1,2,3,4,5])
array([5, 4, 1, 2, 3])
Pandas and Data Frame

 Pandas is a Python library used for working with data sets.


 Panel data, which refers to multidimensional structured data sets that
are commonly used in econometrics and statistics.
 It has functions for analyzing, cleaning, exploring, and manipulating
data.
 Used for a wide range of data analysis tasks, such as data cleaning,
transformation, exploration, and mo
 Pandas provides data structures called
 DataFrame (a two-dimensional table-like data structure, like 2D array, or
a table with rows and columns), and
 Series (a one-dimensional labeled array) that make it easier to work with
and analyze data in Python

Ref: https://www.w3schools.com/python/pandas/pandas_intro.asp
DataFrame

>>> import pandas


>>> student1 = {
"courses": ["F441", "F311"],
"grades": ["A", "B+"]
}
>>> df = pandas.DataFrame(student1)
>>> print(df)
courses grades
0 F441 A
1 F311 B+
>>> print(student1)
{'courses': ['F311', 'F311'], 'grades': ['A', 'B+']}
DataFrame

>>> import pandas as pd


>>> student1 = {
"courses": ["F441", "F311"],
"grades": ["A", "B+"]
}
>>> df = pd.DataFrame(student1)
>>> print(df)
courses grades
0 F441 A
1 F311 B+
>>> print(student1)
{'courses': [‘F441', 'F311'], 'grades': ['A', 'B+’]}
Name: courses, dtype: object
DataFrame (contd…)
Accessing columns and rows of a data frame.
>>> student1 = {
>>> print(df.info())
"courses": ["F441", "F311"],
<class 'pandas.core.frame.DataFrame'>
"grades": ["A", "B+"]
RangeIndex: 2 entries, 0 to 1
}
Data columns (total 2 columns):
>>> df = pd.DataFrame(student1)
# Column Non-Null Count Dtype
>>> print(student1)
--- ------ -------------- -----
{'courses': [‘F441', 'F311'], 'grades': ['A',
'B+’]} 0 courses 2 non-null object

Name: courses, dtype: object 1 grades 2 non-null object

Pandas use the loc attribute to return one or more specified dtypes: object(2)
row(s) memory usage: 160.0+ bytes
>>> print(df.loc[0]) None
courses F441
grades A
Name: 0, dtype: object
pandas.read_csv()

>>> df = pandas.read_csv("data.csv")
>>> print(df.head(10))
Duration Pulse Maxpulse Calories
0 60 110 130 409.1
1 60 117 145 479.0
2 60 103 135 340.0
3 45 109 175 282.4
4 45 117 148 406.0
5 60 102 127 300.0
6 60 110 136 374.0
7 45 104 134 253.3
8 30 109 133 195.1
9 60 98 124 269.0
DataFrame (contd…)

>>> df.info() >>> df.describe()


<class 'pandas.core.frame.DataFrame'> Duration Pulse Maxpulse Calories
RangeIndex: 169 entries, 0 to 168 count 169.000000 169.000000 169.000000 164.000000
Data columns (total 4 columns): mean 63.846154 107.461538 134.047337 375.790244
# Column Non-Null Count Dtype std 42.299949 14.510259 16.450434 266.379919
--- ------ -------------- ----- min 15.000000 80.000000 100.000000 50.300000
0 Duration 169 non-null int64 25% 45.000000 100.000000 124.000000 250.925000
1 Pulse 169 non-null int64 50% 60.000000 105.000000 131.000000 318.600000
2 Maxpulse 169 non-null int64 75% 60.000000 111.000000 141.000000 387.600000
3 Calories 164 non-null float64 max 300.000000 159.000000 184.000000 1860.400000
dtypes: float64(1), int64(3)
memory usage: 5.4 KB
>>>
DataFrame (contd…)

>>> df.info() >>> df.describe()


<class 'pandas.core.frame.DataFrame'> Duration Pulse Maxpulse Calories
RangeIndex: 169 entries, 0 to 168 count 169.000000 169.000000 169.000000 164.000000
Data columns (total 4 columns): mean 63.846154 107.461538 134.047337 375.790244
# Column Non-Null Count Dtype std 42.299949 14.510259 16.450434 266.379919
--- ------ -------------- ----- min 15.000000 80.000000 100.000000 50.300000
0 Duration 169 non-null int64 25% 45.000000 100.000000 124.000000 250.925000
1 Pulse 169 non-null int64 50% 60.000000 105.000000 131.000000 318.600000
2 Maxpulse 169 non-null int64 75% 60.000000 111.000000 141.000000 387.600000
3 Calories 164 non-null float64 max 300.000000 159.000000 184.000000 1860.400000
dtypes: float64(1), int64(3)
memory usage: 5.4 KB Percentiles:
How many of the values are less than the given
>>> percentile?
Flow Control

 if-then-else
 for loop
 while loop
Indents

 Code blocks are created using indents.


 Indentation is done using spaces: say 2 or 4 spaces. But should be
consistent throughout the file.
def fib(n):
# Indent level 1: function body
if n <= 1:
# Indent level 2: if statement body
return 1
else:
# Indent level 2: else statement body
return fib(n-1)+fib(n-2)
Flow control: if-then-else

 In Python, there are three forms of the if...else statement.


 if statement
 if...else statement
 if...elif...else statement
python if statement

 Syntax of if statement  Syntax of if-then-else statement


if condition: if condition:
# body of if statement # block of code if condition is True
 example: else:
>>> number = 10 # block of code if condition is False
 example:
>>> if number > 0: >>> number = 10
... print('Number is positive.') >>> if number > 0:
Number is positive. ... print('Positive number')
... else:
... print('Negative number')
Positive number
if...elif...else statement

 Syntax:  Example:
if condition1: >>> number = 0
# code block 1 >>> if number > 0:
elif condition2: ... print("Positive number")
# code block 2 ... elif number == 0:
else: ... print('Zero')
# code block 3 ... else:
... print('Negative number’)

Zero
for loop

 for loop is used to run a block of  Example:


code for a certain number of >>> for i in range(5):
times.
... print(i)
 It is used to iterate over any
0
sequences such as range, list,
tuple, string, etc. 1
2
 Syntax:
3
for val in sequence:
4
# statement(s)

The range() function returns a sequence of numbers, starting from 0 by default, and
increments by 1 (by default), and stops before a specified number.
syntax:
range(start, stop, step)
for loop: unpacking tuples

>>> aList = [(1,10), (2,20), (3,30)]


>>> for x, y in aList:
... print(x,y)

1 10
2 20
3 30
while loop

 while loop is used to run a block of >>> i = 0


code until a certain condition is >>> while i < 5:
met.
... i += 1
... print(i)
 The syntax of while loop is
while condition:
1
# body of while loop
2
3
4
5
Python Function

 A function is a block of code >>> def square(num):


which only runs when it is called. ... return num * num
 Function can be called with data,
known as parameters (or
arguments). >>> for i in [1,2,3]:
 A function can optionally return ... # function call
data as result.
... result = square(i)
 syntax:
def function_name(arguments): ... print(f'Square of {i} = {result}')
# function body
return Square of 1 = 1
Square of 2 = 4
Square of 3 = 9
Python lambda function

 A lambda function is a small >>> x = lambda a : a + 10


anonymous function. >>> print(x(5))
 A lambda function can take any 15
number of arguments but can only
have one expression.
 syntax: >>> x = lambda a, b, c : a + b + c
lambda arguments : expression >>> print(x(5, 6, 2))
13
CS F441
Data Visualization
SUMANTA PATTANAIK
WEEK 2
 Acquisition:
 Download, or
 Manually gather, or

Preparing  Extract from a Database, from a website or some


document and Consolidate

and  Examination:
Metadata
Familarizing

 Completeness

with Data  Quality (errors, unusual data, …)


 Transforming for quality: Cleaning the data
 Transforming for analysis
 Dimension reduction
 Information regarding a data set of interest.
 Provides information that can help in its
interpretation
 the format of individual fields within the data
records.
Data  the base reference point from which some of the
data fields are measured,
Examination:  the units used in the measurements,
the symbol or number used to indicate a missing
Metadata

value,
 and the resolution at which measurements were
acquired.
 Important in selecting the appropriate
preprocessing operations, and in setting their
parameters.
 IMAGE DATA: Image files within this directory contain 2
dimensional views of a male cadaver, as collected for the
National Library of Medicine's Visible Human Program.
 Anatomical Area:
 TOTAL BODY Three sets (t1,t2,pd) of MRI male images.
Type image: GE MRI, Signa v5.2 Frame size: 256, 256
A Sample

 Specifies the image size (width, height) in pixels.

Metadata  Pixel size: SEE FILE HEADER,, Specifies the pixel size
(width, height, separation) in millimeters.
 Image format: GE 16 BITS, Compressed Unix compressed,
use "uncompress [filename]" to restore.
 Header size: 7900. The header block size in bytes.
 Coordinate offset: NONE,NONE
 If images files are cropped to remove empty pixels,
these offsets are provided, in pixels, relative to a fixed
coordinate plane.
 …
 High-quality data needs to pass a set of quality
criteria.
 Validity

Data Quality 


Accuracy
Completeness
 Consistency
 Uniformity

See:
https://en.wikipedia.org/wiki/Data_cleansing#Data_quality
 Validity: The degree to which the data conform to
defined domain rules or constraints.
 Data-Type Constraints: values in a particular column must
be of a particular datatype, e.g., boolean, numeric, date,
etc.
 Range Constraints: typically, numbers or dates should fall
within a certain range.
 Mandatory Constraints: certain columns cannot be

Data Quality
empty.
 Unique Constraints: a field, or a combination of fields,
must be unique across a dataset.
 Set-Membership constraints: values of a column come
from a set of discrete values, e.g. enum values. For
example, a person’s gender may be male or female.
 Regular expression patterns: text fields that have to be in
a certain pattern. For example, phone numbers may be
required to have the pattern (999) 999–9999.
 Cross-field validation: certain conditions that span across
multiple fields must hold. For example, a patient’s date of
discharge from the hospital cannot be earlier than the
date of admission.
 Accuracy: The degree of conformity of a
measure to a standard or a true value
ex: 6 digit Pin Code in an address
Data Quality

 Accuracy is very hard to achieve in the general


case, because it requires accessing an external
source of data that contains the true value: such
"gold standard" data is often unavailable.
 Completeness: The degree to which all
required measures are known.

Data Quality  Incompleteness is almost impossible to fix : one


cannot infer facts that were not captured when
the data in question was initially recorded.
 Ex: User response
 Consistency: The degree to which the data is
consistent, within the same data set or across
multiple data sets.
Data Quality  Inconsistency occurs when two values in the
data set contradict each other.
 Ex: Age and Marital status
 Uniformity: The degree to which a set data
measures are specified using the same units of
Data Quality measure in all systems.
 ex: Weight or height data related to an
international event (say Olympics)
 Data Cleaning: (also called Data wrangling)
the process of detecting and correcting (or
Data

removing) corrupt or inaccurate records from a
record set, table, or database
Cleaning  identifying incomplete, incorrect, inaccurate or
irrelevant parts of the data and
 replacing, modifying, or deleting the dirty data
 Main Steps of Data Cleaning:
 Inspection: Detect unexpected, incorrect, and
inconsistent data.

Data  Cleaning: Fix or remove the anomalies


discovered.
Cleaning  Reporting: A report about the changes made
and the quality of the currently stored data is
recorded.
 Verify and Repeat

Source:
https://towardsdatascience.com/the-ultimate-guide-to-
data-cleaning-3969843991d4
 Inspection: Detect unexpected, incorrect,
and inconsistent data.
 Data profiling: Generate A summary
statistics about the data
 Is the data column recorded as a string or

Data 
number?.
How many values are missing?
Cleaning  How many unique values in a column, and their
distribution?
 Statistical Analysis and Visualization of Distribution
 mean, standard deviation, mean, range, or
quantiles.
Source:
https://towardsdatascience.com/the-ultimate-guide-to-
data-cleaning-3969843991d4
 Cleaning: Fix or remove the anomalies
discovered.
 Missing Values:
Drop Row: missing values in a column rarely
Data

happen and occur at random

Cleaning
 Drop Column: most of the column’s values are
missing, and occur at random
 Assign Value (impute): Mean/Median value or
prediction using Linear regression.
 Do nothing but Flag

Source:
https://towardsdatascience.com/the-ultimate-guide-to-
data-cleaning-3969843991d4
 Cleaning: Fix or remove the anomalies
discovered. (Continued)
 Remove Irrelevant/Duplicate data

Data  Convert data type: ex: “Make sure numbers


are stored as numerical data types”

Cleaning  Massage string data: Fix typos, remove extra


white space, Capitalize etc..
 Standardize data: Same unit of
measurement (ex: M or cm or mm),
European or USA version (date format)
Reporting: A report about the changes made
Data

and the quality of the currently stored data is
recorded.
Cleaning

Source:
https://towardsdatascience.com/the-ultimate-guide-to-
data-cleaning-3969843991d4
Tools:
Data 
 Drag and Drop Tools
Cleaning  Script Based
 Trifacta Wrangler: Messy Data Accepted
https://www.trifacta.com/products/wrangler/

Drag-and-  Originally Stanford/Berkeley Data Wrangler


project
Drop app for
Data
Cleaning
Drag-and- OpenRefine: http://openrefine.org/
Drop App Originally Google Refine.

for Data Links to other free/commercial Data cleaning

Cleaning tools can be found from


https://en.wikipedia.org/wiki/Data_cleansing
Based on  Functions in Python Pandas
Javascript Array functions and
Scripting

additional Tools in D3
Languages  Functions in R Tidyverse
Visualization Tools
Two
Drag-and-Drop Tools:
Categories

 Based on Scripting Languages

of Tools
 Drag-and-Drop Tools:
 Rely on a Graphical User Interface
 Make assumptions about what you may like to
Two do
 Ex: You may draw SALES to Y-axis and DATE to the
Categories X-axis. The tools assumes that you are interested in
graphing total sales per month.

of Tools  Examples: Tableau, PoweBI


 Based on Scripting Languages
 You decide what and how you want to create
visualization.
#1 most-used Business Intelligence software tool
 Pro: Simple and easy to be a beginner user
 connect to your data source, whether an Excel
file, a database connection, or any of
the dozens of other connection options
 drag the variable names you want onto a graph
object (a “sheet”) and customize as you see fit.

Tableau  combine sheets into a dashboard in whatever


configuration you like, and get creative with
parameters, filters, or other customization
options.
 Allows public and private hosting
 Free license for students
 Cons: “A minute to learn, a lifetime to master!”
 Less flexible
Picking up to be the most successful analytic and
business intelligence platform
 Pro: Like Tableau Simple and easy to be a
beginner user
 connect to your data source, whether an Excel file,

Microsoft a database connection, or any of the dozens of


other connection options

Power BI  drag the variable names you want onto a graph


object (a “sheet”) and customize as you see fit.
 Easy integration with analysis

 Cons:
 Development time can be long
 Expensive
 Drag-and-Drop Tools:
Two  Based on Scripting Languages

Categories
 Ex: matplotlib and Plotly in Python; D3,
Observable Plot, Plotly in Javascript, ggplot,
Plotly in R, …
of Tools  Better control on the result, but you have to be
explicit about what you want.
This class will
Python: Scripting language
use Scripting

 Plotly and matplotlib : For data visualization

Languages
 Matplotlib is a popular Python library for creating visualizations.
 matplotlib.pyplot: This is the primary module used for creating
visualizations. It provides a simple interface for creating plots and
charts
 Pros:
 Versatile and Customizable
 Wide Adoption: Being one of the oldest and most popular plotting
libraries in Python, Matplotlib is widely adopted and often
considered the starting point for many data visualization tasks.
 Cons:

Matplotlib 


Steep Learning Curve.
Limited interactivity
 Its default styles might not always produce the most
aesthetically pleasing plots compared to some other
libraries.
 As in any scripting-based visualization tool, a lot of code to
write.
 We will mostly use Pandas.plot and Seaborn: Libraries
developed on the top of MatplotLib. Reduce coding
load.

https://matplotlib.org/
 Open Source graphics library for creating interactive,
publication-quality graphs. It has a concise and (hopefully)
memorable functions to foster fluency
 Pros:
 Interface is available to Python, R, Javascript, Matlab, Julia

Plotly 


Great support for interaction
Beautiful visualizations
 Cons:
 As in any scripting-based visualization tool, a lot of
code to write.
 We will mostly use Plotly.express: A library developed
on the top of Plotly. Reduces coding load.

https://plotly.com/graphing-libraries/
Course  Learn how to examine data and relationship
among variables though visualization and
Goal: statistical tools with a goal towards

Exploratory
 Building insight into the data & process that
generated the data.

Data 


Finding out what may be interesting.
Determining which variables have the most

Analysis predictive power.


Plotly and MatplotLib for Pandas
 Independently specify building blocks and
combine them to create just about any kind of
graphical display you want.
 Building blocks of a graph:
 data

Grammar of 


geometric shapes
coordinate system
Graphics  aesthetic mapping:
 Mapping of data dimensions to visual dimensions
 Scales
 statistical transformations
 position adjustments
 faceting
Example

import pandas as pd
alphabets = pd.read_csv( "english-letter.csv")
Example
Columns/Channels/Data Dimensions

import pandas as pd
alphabets = pd.read_csv( "english-letter.csv")

Data Array/
Data Frame/
Data Table
Plotly Example
import pandas as pd
import plotly.express as px
alphabets = pd.read_csv( "english-letter.csv")

Geometric Data Array


shapes

px.bar(alphabets,
x="letter",
Data Channels
y="frequency“
)
Pandas Matplotlib Example
import pandas as pd
alphabets = pd.read_csv( "english-letter.csv")
alphabets.plot.bar(x="letter",y="frequency",
figsize = (10,5), rot=0, width=0.8)

Geometric
Data Array shapes

alphabets.plot.bar(
x="letter",
Data Channels
y="frequency“
)
Plotly Example

import plotly.express as px
iris = px.data.iris()

Datasets: https://plotly.com/python-api-reference/generated/plotly.data.html
Plotly Example

import plotly.express as px
iris = px.data.iris()

fig = px.scatter(
iris,
x="sepal_width",
y="sepal_length")
fig.show()
Plotly Example

import plotly.express as px
iris = px.data.iris()

Geometric
shapes

fig = px.scatter(
iris,
x="sepal_width",
y="sepal_length")
fig.show()
Plotly Example

import plotly.express as px
iris = px.data.iris()

fig = px.scatter(iris,
x="sepal_width",
y="sepal_length“,
color=“species”
)
fig.show()
Plotly Example

import plotly.express as px
iris = px.data.iris()

Geometric
shapes

fig = px.scatter(iris,
x="sepal_width",
y="sepal_length“,
color=“species”
) Data
Dimensions
fig.show()

Visual Attributes/ Visual Dimensions


Matplotlib (Seaborn) Example

import plotly.express as px
iris = px.data.iris()

fig, ax = plt.subplots(
figsize=(10, 5)
)
sns.scatterplot(data=iris,
x="sepal_width",
y="sepal_length",
hue = "species“,
ax=ax)
Plotly Example

import plotly.express as px
iris = px.data.iris()
fig = px.scatter(iris,
x="sepal_width",
y="sepal_length“,
color=“species”,
color_discrete_sequence=
px.colors.qualitative.Light24
)
fig.show() Mapping function or
scaling function
Data Graphics types
Both Plotly and Matplotlib support a large number of data
graphics types.
 Commonly used ones are
 Basic charts
 line chart
 bar chart
 scatter plot, bubble chart
 pie chart
 Statistical charts
 histogram
 box plot
 violin plot
 error plot
 distribution plots
 geo plots
 choropleths
 geobubble plots
 …
 Data Component
 Geometry/Graphic Component

Main  ex: scatter plot, barplot, histogram, smooth


densities, boxplot.

Components  Aesthetics Component


visual cues to represent the information provided
of Data

by the dataset. Ex:

Visualization  Position: two most important cues in this plot are


the point positions on the x-axis and y-axis.
 Color, Shape, Size, …: other visual important cues
Scales

 Scales map data Onto aesthetics


 a scale must be one-to-one
Location and Coordinate System

 Cartesian Coordinates:
 2D Cartesian coordinates is the
widely used in data visualization
 Axes are orthogonal
 Represent both positive and
negative real numbers.
Example

MTCar Dataset:
Motor Trend magazine 1974
Effectiveness of Various Visual
properties for Data

 some graphical properties are more effective than others when it


comes to conveying information.
Effectiveness of Various Visual
properties for Quanitative Data
 Order of effectiveness in mapping :
 Position on the axis
 Length/width of line
 Area of shape
 Color Saturation
Effectiveness of Various Visual
properties for Ordinal Data
 Order of effectiveness in mapping :
 Position on the axis
 Color Saturation
 Color Hue/Texture
 Shape

Source: Mackinlay et al 1986


Effectiveness of Various Visual
properties for Nominal Data
 Order of effectiveness in mapping :
 Position on the axis
 Color Hue
 Texture
 Shape

Source: Mackinlay et al 1986


Mapping and Scales
Aesthetics

 Visualization of all types of data


needs a graphic element
 Dot, line, box, text, pie…
 Aesthetics describe various
aspects of graphic elements
 See examples
 Some can represent continuous
data, some can represent discrete
data and some can represent the
both.
Scale Functions

 Scale functions map a dimension of abstract data to its visual


representation.
 take an input in a certain interval called domain: Data Dimension
 a number, date or category
 return output in another interval: Visual property
 position coordinate
 a length or a radius
 a color
 Considers 2 types of data
 Contiuous: Quantitative
 Discrete:
 explicit set of values
 Ordinal, Categorical
Continuous Input Continuous
Output
 Linear Scale
 most suitable scale for transforming
data values into positions and
lengths
 use a linear function (y = m * x + b)
to interpolate across the domain
and range
 Domain and Range:
 Domain:
[minCoordinate,maxCoordinate]
 Range: ([0, width])
Continuous Input Continuous
Output
 Linear Scale: Uses a linear function (y = m * x + b)

 SQRT Scale: Uses a function linear in sqrt of x. (y = m * sqrt(x) + b)

 Log Scale: Uses a function linear in log of x. (y = m * log(x) + b)

 Log scale

https://clauswilke.com/dataviz/coordinate-systems-axes.html
Continuous Input Continuous
Output
 Polar (Curvilinear) Scale
 Linear Transformation of data values into
angles and radial distances from origin.
 use a linear function ( = m * x + b) to
interpolate across the domain and range
 Generally, one od the axis is assigned to the
discrete input data
 Domain and Range:
 Domain: [0,maxCoordinate]
 Range: ([0, 2])

 Use: For cyclic data,

https://clauswilke.com/dataviz/coordinate-systems-axes.html
Continuous data to continuous
Color

https://clauswilke.com/dataviz/coordinate-systems-axes.html
Scale Functions

 Scales with discrete input and discrete output


 maps discrete values (specified by an array) to discrete values (also
specified by an array). The domain array specifies all possible input
values and the range specifies all possible output values. The range
array will repeat if it’s shorter than the domain array..
 Ex: colorScale
 Domain: list/array of categorical values
 Range: List/array of positions, colors, …
Discrete Input to Discrete Output

 discrete set of Categorical values to equally spaced points along


the specified range.
Discrete Input to Distinguishable
Color Output Categorical Data

https://clauswilke.com/dataviz/coordinate-systems-axes.html
Summary of Data and types
Components of Data
Visualization
 Data Component
 Geometry/Graphic Component
Main  ex: lines, bars, symbols.

Components  Aesthetics Component


 visual cues to represent the information provided
of Data by the dataset. Ex:
 Position: two most important cues in this plot are
Visualization the point positions on the x-axis and y-axis.
 Color, Shape, Size, …: other visual important cues
Example
Columns/Channels/Data Dimensions

import pandas as pd
alphabets = pd.read_csv( "english-letter.csv")

Data Array/
Data Frame/
Data Table
Pandas Matplotlib Example
import pandas as pd
alphabets = pd.read_csv( "english-letter.csv")
alphabets.plot.bar(x="letter",y="frequency",
figsize = (10,5), rot=0, width=0.8)

Geometric
Data Array shapes

alphabets.plot.bar(
x="letter",
Data Channels
y="frequency“
)
Plotly Example
import pandas as pd
import plotly.express as px
alphabets = pd.read_csv( "english-letter.csv")

Geometric Data Array


shapes

px.bar(alphabets,
x="letter",
Data Channels
y="frequency“
)
Plotly Example

import plotly.express as px
iris = px.data.iris()

Datasets: https://plotly.com/python-api-reference/generated/plotly.data.html
Plotly Example

import plotly.express as px
iris = px.data.iris()

fig = px.scatter(
iris,
x="sepal_width",
y="sepal_length")
fig.show()
Plotly Example

import plotly.express as px
iris = px.data.iris()

Geometric
shapes

fig = px.scatter(
iris,
x="sepal_width",
y="sepal_length")
fig.show()
Plotly Example

import plotly.express as px
iris = px.data.iris()

fig = px.scatter(iris,
x="sepal_width",
y="sepal_length“,
color=“species”
)
fig.show()
Plotly Example

import plotly.express as px
iris = px.data.iris()

Geometric
shapes

fig = px.scatter(iris,
x="sepal_width",
y="sepal_length“,
color=“species”
Data
) Dimensions

Visual Attributes/ Visual Dimensions


Matplotlib (Seaborn) Example

import plotly.express as px
iris = px.data.iris()

fig, ax = plt.subplots(
figsize=(10, 5)
)
sns.scatterplot(data=iris,
x="sepal_width",
y="sepal_length",
hue = "species ")
Plotly Example

import plotly.express as px
iris = px.data.iris()
fig = px.scatter(iris,
x="sepal_width",
y="sepal_length“,
color=“species”,
color_discrete_sequence=
px.colors.qualitative.Light24
)
fig.show() Mapping function or
scaling function
Effectiveness of Various Visual
properties for Data

 some graphical properties are more effective than others when it


comes to conveying information.
Effectiveness of Various Visual
properties for Quanitative Data
 Order of effectiveness in mapping :
 Position on the axis
 Length/width of line
 Area of shape
 Color Saturation
Effectiveness of Various Visual
properties for Ordinal Data
 Order of effectiveness in mapping :
 Position on the axis
 Color Saturation
 Color Hue/Texture
 Shape

Source: Mackinlay et al 1986


Effectiveness of Various Visual
properties for Nominal Data
 Order of effectiveness in mapping :
 Position on the axis
 Color Hue
 Texture
 Shape

Source: Mackinlay et al 1986


Mapping and Scales
Mapping of Data to Aesthetics

 Visualization of all types of data


needs a graphic element
 Dot, line, box, text, pie…
 Aesthetics describe various
aspects of graphic elements
 See examples shown in the figure
 Some can represent continuous
data, some can represent discrete
data and some can represent the
both.
Scales

 Scales map data Onto aesthetics


 a scale must be one-to-one
Location and Coordinate System

 Cartesian Coordinates:
 2D Cartesian coordinates is the
widely used in data visualization
 Axes are orthogonal
 Represent both positive and
negative real numbers.
Example

MTCar Dataset:
Motor Trend magazine 1974
Scale Functions

 Scale functions map a dimension of abstract data to its visual


representation.
 take an input in a certain interval called domain: Data Dimension
 a number, date or category
 return output in another interval: Visual property
 position coordinate
 a length or a radius
 a color
 Considers 2 types of data
 Contiuous: Quantitative
 Discrete:
 explicit set of values
 Ordinal, Categorical
Continuous Input Continuous
Output: Linear scale
 Most suitable scale for
transforming data values into
positions and lengths
 use a linear function (y = m * x + b)
to interpolate across the domain
and range
 Domain and Range:
 Domain:
[minCoordinate,maxCoordinate]
 Range: ([0, width])
Continuous Input Continuous
Output: log scale
 most suitable scale for
transforming data values when the
data represents a large range,
such as an exponential growth
rate.
 Ex: Population of countries,
revenue of companies.
 use a sqrt function (y = m * log(x) +
b) to interpolate across the
domain and range
 Domain and Range:
 Non-negative non-zero Domain:
[minCoordinate,maxCoordinate]
 Range: ([0, width])
Continuous Input Continuous
Output: sqrt scale
 Most suitable scale for transforming
data values into diameter of dot.
 use a sqrt function (y = m * sqrt(x) +
b) to interpolate across the domain
and range
 Domain and Range:
 Non-negative Domain:
[minCoordinate,maxCoordinate]
 Range: ([0, width])
Continuous Input Continuous
Output
 Linear Scale: Uses a linear function (y = m * x + b)

 SQRT Scale: Uses a function linear in sqrt of x. (y = m * sqrt(x) + b)

 Log Scale: Uses a function linear in log of x. (y = m * log(x) + b)

 Log scale

https://clauswilke.com/dataviz/coordinate-systems-axes.html
Gapminder Data Source:
https://plotly.com/python/plotly-express/
Continuous Input Continuous Output:
Polar (Curvilinear) Scale
 Linear Transformation of data values into
angles and radial distances from origin.
 use a linear function ( = m * x + b) to
interpolate across the domain and range
 Generally, one od the axis is assigned to
the discrete input data
 Domain and Range:
 Domain: [0,maxCoordinate]
 Range: ([0, 2])
 Use: For cyclic data,

https://clauswilke.com/dataviz/coordinate-systems-axes.html
Nightingale’s
Rose plot
Radial Bar Star plot
Plot (Spider plot)

Source: https://support.apple.com/en-in/guide/watch- Source: https://datavizcatalogue.com/methods/radar_chart.html


ultra/apde9218b440/watchos
Continuous data to continuous
Color

https://clauswilke.com/dataviz/coordinate-systems-axes.html
Continuous data to
continuous Color
Continuous data to
continuous Color

Source: The scale and drivers of carbon footprints in households, cities and regions across India
January 2021 Global Environmental Change 66(11):102205
Scale Functions

 Scales with discrete input and discrete output


 maps discrete values (specified by an array) to discrete values (also
specified by an array). The domain array specifies all possible input
values and the range specifies all possible output values. The range
array will repeat if it’s shorter than the domain array..
 Ex: colorScale
 Domain: list/array of categorical values
 Range: List/array of positions, colors, …
Discrete Input to Discrete Output

 discrete set of Categorical values to equally spaced points along


the specified range.
Discrete Input to Distinguishable
Color Output Categorical Data

https://clauswilke.com/dataviz/coordinate-systems-axes.html
Gapminder plot

Source:
https://www.gapminder.org
/tag/chart/
Summary of Data and types
Common, Effective
Data Visualization
Techniques
1D DATA
1D Data

Iris Data Set

iris = px.data.iris()
Visualizing 1D Data

 Tabulation in order
 Nominal Data: Alphabetic order
 Ordinal and Quantitative data:
Could be increasing or decreasing
Visualizing 1D Data

 nominal data
 Tabulate
 Sorted Tabulate
 Total Count of unique category

# Count duplicates
tips = px.data.tips()
tips["day"].value_count()
1D Data Visualize

 nominal data
 Tabulate
 Sorted Tabulate
 Total Count of unique category
 Bar plot (Histogram plot)
1D Quantitative Data

 iris dataset

iris = px.data.iris()
1D Quantitative Data

Iris Data Set


1D Quantitative Data

Old Faithful
Data Set
Distribution  Examining sets of quantitative values:
 How are the values distributed from
Analysis lowest to highest?

Ref: Chapter 6. “Distribution” in Better Data Visualization


1D Quantitative Data

 iris dataset
 Distribution analysis
 3 Key Characteristics of Distribution
 Spread: The difference between the
maximum value and the minimum value (or
the range of the data).
 Wide or short

Distribution  Center: The mean/median of the data


 Mean: Mathematical average
Analysis  Median: One half of the data lie on one
side of the median and the other half on
the other side.
 Shape:
 Symmetric/uniform/skewed
 Unimodal/multimodal
Data
Distribution
 Spread
 Range of Data
 Difference between the
largest and smallest
point

 Interquartile range:
 Measure of difference
between upper (75%)
and lower quartile (25%)
 Where the majority
values lie.

Source: https://www.onlinemathlearning.com/quartile.html
Box plot: Distribution visualization
Histogram: Distribution Visualization

 Histogram:
 A better way to visualize the
distribution of 1D data. Often used
in statistical analysis.
 shows the number of data points
(frequency) that lie within intervals,
called bins
 visualized as a collection of
rectangles. The frequency and the
width of the bin interval represent
the height and width of a
rectangle.
Histogram: Distribution Visualization

 Y-axis can represent:


 Count
 Percent
 Probability Density
Histogram: Distribution Visualization

 Statistical parameters
shown on the histogram
Histogram

 Best Practices
 Keep interval constant
 Select best interval

Source: https://statistics.laerd.com/statistical-guides/understanding-
histograms.php
Histogram: Distribution Visualization

 Histogram with continuous


density plot overlapped.
Density Plot

 A Density Plot visualizes the


distribution of data over a
continuous interval or time period.
 Uses kernel smoothing to plot
values, allowing for smoother
distributions by smoothing out the
noise.
 The peaks of a Density Plot help
display where values are
concentrated over the interval.
Visualizing 1D Quantitative Data

 Density Plot: Smoothening Histogram. Density plot over Histogram of Eruption Durations
from Old Faithful dataset.
 Resources:
 Wiki page on Kernel Density Estimation:
https://en.wikipedia.org/wiki/Kernel_densit
y_estimation
 https://mathisonian.github.io/kde/
Density plot of a Normal distribution

 Spread
 Range of Data
 Difference between the largest and
smallest point

 Interquartile range:
 Measure of difference between
upper (75%) and lower quartile (25%)
 Where the majority values lie.

 Mean and Standard deviation:


 +/- 1 SD: 68%
 +/- 2 SD: 95%
 +/- 3 SD: 99.7%

Source: https://www.onlinemathlearning.com/quartile.html
Data
Distribution
 Shape
 Symmetric
 Uniform
 Skewed

Source: https://towardsdatascience.com/intro-to-descriptive-statistics-
252e9c464ac9
Data
Distribution
 Shape
 Symmetric
 Uniform
 Skewed
 Multi-modal

Source: https://towardsdatascience.com/intro-to-descriptive-statistics-
252e9c464ac9
Distribution
Display
 Single distribution display
 Histogram
 Frequency polygon:
 A line graph that
exclusively draws
attention to the
distribution’s shape with
minimal distraction

 Strip plot
 Stem-and-leaf plot
Frequency Polygon

 uses the same two axes as a


histogram,
 is constructed by placing a point
at the center of each interval such
that the height of the point is
equal to the frequency or relative
frequency associated with that
interval.
 Points may also placed on the
horizontal axis at the midpoints of
the intervals
Distribution
Display
 Single distribution display
 Histogram
 Frequency polygon:
 Strip plot
 1-D scatter plot

 Stem-and-leaf plot
Visualizing 1D Quantitative Data

 Strip Plot:
 the viewer gets an idea
about the range(s) of
values that are more
frequent and those that are
less frequent

Point plot (top) and Jitter plot (bottom) of Iris


Sepal length.
Distribution
Display
 Single distribution display
 Histogram
 Frequency polygon:
 Strip plot
 Stem-and-leaf plot
 Similar to histogram
 Data is sorted and
divided into stem interval
at appropriate places
 Works well for small data
set.
Violin Plot for 1D Data Visualiation

 Violin plots are similar to box plots,


except that they also show the
probability density of the data at
different values.
Anatomy of
Violin Plot
 a combination of a Box Plot and
a Density Plot.
 The white dot in the middle is the
median value and the thick black
bar in the center represents the
interquartile range.
 Sometimes the plot is clipped from
the end of thin boxplot line.

Ref: How to Interpret Violin Plot


 Box Plots are limited in their display of the data,
as their visual simplicity tends to hide significant
details about how values in the data are
Violin plot vs distributed.

Box plot  For example, with Box Plots, you can't see if the
distribution is bimodal or multimodal.
 Violin Plots display more information, they can be
noisier than a Box Plot.

https://datavizcatalogue.com/methods/violin_plot.html
Multiple Distribution Display

Iris Data Set


Distribution
Display
 Multiple distribution display
 Box Plots
 Violin Plots
 Multiple Strip plots
 Frequency Polygons

Source: https://datavizcatalogue.com/methods/box_plot.html
Distribution
Display
 Multiple distribution display
 Box Plots
 Violin Plots
 Multiple Strip plots
 Frequency Polygons

Source: https://blogs.sas.com/content/graphicallyspeaking/2013/03/24/custom-box-plot
Distribution
Display
 Multiple distribution display
 Box Plots
 Violin Plots
 Multiple Strip plots
 Frequency Polygons

https://datavizcatalogue.com/methods/violin_plot.html
Distribution Display

 Multiple distribution display


 Box Plots
 Violin Plots
 Multiple Strip plots
 Frequency Polygons
Distribution Display

 Multiple distribution display


 Box Plots
 Violin Plots
 Multiple Strip plots
 With jitter

 Frequency Polygons
Distribution
Display
 Multiple distribution display
 Box Plots
 Violin Plots
 Multiple Strip plots
 Frequency Polygons

Comparing City Mileage levels of 4 types of car


Population Pyramid

 A Population Pyramid
is a pair of back-to-
back Histograms (for
each sex) that
displays the
distribution of a
population in all age
groups and in both
sexes.
Population Pyramid

 The X-axis is used to plot


population numbers and the Y-axis
lists all age groups.
 Population Pyramids are ideal for
detecting changes or differences
in population patterns.
 Multiple Population Pyramids can
be used to compare patterns
across nations or selected
population groups or across
time,…

https://datavizcatalogue.com/methods/population_pyramid.html

You might also like