0% found this document useful (0 votes)

111 views246 pages

1.fundamentals of 1D Visualization

Uploaded by

Shashank S

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

111 views246 pages

1.fundamentals of 1D Visualization

Uploaded by

Shashank S

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 246

CS F441

Data Visualization
SUMANTA PATTANAIK
 Sumanta Pattanaik: Visiting Professor
 Email: [email protected]
pilani.ac.in

Instructors 


Tathagat Ray: Professor, Hyderabad campus
Vinayak Naik: Professor, Goa Campus
 Sundaresan Raman: Professor, Pilani Campus

8/21/2023 2
 What will be covered ?
 Principles and techniques for interactive data
visualization that are useful for presenting and
analyzing and presenting information associated

Course 
with the data.
Algorithmic aspects of developing interactive
Particulars visualization.
 The students will receive practical experience of
creating visualization using
 Python based tools for data analysis and Plotly's
Python graphing library and Matplotlib for data
visualization.
Course  Text book: Fundamentals of Data Visualization
by Claus O. Wilke
Particulars  Online link: https://clauswilke.com/dataviz/
 Programming Knowledge: An object based
language expertise is essential.
 Computer Programming and Data Structure, OO
Prerequisite Programming.
 NOTE: This class has a strong programming
component. All the assignment and project will
be done using Python, Plotly and Matplotlib.
 BITS Policies
 Students are expected to familiarize themselves
with and follow the BITS Rules of Conduct.
General  Talk to your local faculty instructors

Policies  Plagiarism in projects in discouraged

 Penalties can include a failing grade in the course,
or suspension or …as per BITS disciplinary rules.
 Evaluation Policies
 Regular Visualization assignments [20%]:
 Data cleaning and Visualization
General  Regular (weekly) Quizzes[20%]

Policies 


Mid Term [20%]
Comprehensive [30%]
 Final Project [10%]
 Assignment Policies
 Assignments must be turned in by deadline, mostly
set to 11:55 pm of the date.
 Programming Assignments must be carried out using
Jupyter notebook.
General  Assignment Submission is through Google
Classroom. (submit assignment_xx.ipynb file).
Policies  Programming in Python, Visualization using Plotly,and
Matplotlib
 Compute Platform:
 Any computer/OS running python and Jupyter
 Students will work on the assignment independently.
 Quizzes: 20 points. (open book)
 will be conducted weekly, online, during class
hours.

General  Questions will be from the material covered the

previous week.

policies  Midterm: 20 points. (closed book)

 Check your campus calendar for the date/time
 Comprehensive: 30 points. (closed book)
 Check your campus calendar for the date/time

8/21/2023 29
 Quizzes: 20 points. (open book)
 will be conducted weekly, online, during class
hours.
 Questions will be from the material covered the
previous week.
 Midterm: 20 points. (closed book)
General  Check your campus calendar for the date/time

policies  Comprehensive: 30 points. (closed book)

 Check your campus calendar for the date/time
 Final Project: 10 points (open book)
 Final Project can be a group project
 Project can be chosen in discussion with the
local instructors and presented by the last class
day.

8/21/2023 30
Topics

 Data visualization overview.

 Data Manipulation and Data Processing
 Data visualization and Visual Analysis
 Perceptual and design principles for effective data visualization.
 Interaction: Concepts and techniques.
 High dimensional data visualization.
Rough plan for the next few Classes

 Lectures
 Overview
 Getting started with Python on Jupyter
 Getting started with Plotly, Matplotlib/seaborn
 Assignments will be mostly on getting the class familiarized with
 Python, NumPy and Pandas
 Plotly and Matplotlib for visualization
Get Started

 Python setup: https://www.python.org/about/gettingstarted/

 Working with Jupyter Notebook: https://realpython.com/jupyter-
notebook-introduction
Questions?
Overview
What is Data  Visualization
 Visualization is the process of understanding
Visualization? something by creating its mental picture.
Data Visualization
What is Data


 understanding of the data by creating its mental

Visualization? 
picture.
Any technique that helps in creating the
mental picture will be called data visualization.
Table for Visualization

 Example: CECS Graduate Student Diversity

If you want to show the exact amount of every value in your

data, a table might be your best solution.
Anatomy of a Table

10 primary component.
Source: “Tables”, Chapter 11, Better Data Visualization.
Table for Visualization

 What about the following data?

What is Data  Human brain is better at understanding visual
imagery compared to numbers or words.
Visualization?
 Anecdotal evidence: People remember:

Brain and
Visual 
[https://blog.hubspot.com/agency/science-brains-crave-infographics]
Verified Evidence:

Information 50% of the brain is devoted to visual processing.

A majority of sensory receptors are in our eye.
The brain can identify images seen for as little as 13 milliseconds.
[http://news.mit.edu/2014/in-the-blink-of-an-eye-0116]
 Visualization through visual imagery has been an effective way to
communicate both abstract and concrete ideas since the dawn of
humanity. [wiki]
 Data visualization is the process of creating a visual
What is data representation of the data for accelerating its
understanding.

visualization  Hence, the process of creating the visual form has

become synonymous with data visualization.
“The greatest value of a visual (picture) is when it
Data

forces us to notice what we never expected to
see”
Visualization [John W. Tukey author of Exploratory Data Analysis]
Anscombe’s Quartet

Value of
Data
Visualization
 Sampled data sets from “Graphs in statistical
analysis”. by F. J. Anscombe, in American
Statistician, 27, 17–21 (1973)
Anscombe’s Quartet

 Sampled data sets from “Graphs in statistical

analysis”. by F. J. Anscombe, in American
Statistician, 27, 17–21 (1973)

See: Chapter 1, Better Data Visualization.

Anscombe’s Quartet

Value of  Visualization of Sampled data sets from

Data “Graphs in statistical analysis”. by F. J.

Anscombe, in American Statistician, 27, 17–21
(1973)
Visualization

See more from : https://dabblingwithdata.wordpress.com/2017/05/03/the-datasaurus-a-monstrous-anscombe-for-the-21st-

century/
RECAP:  Presenting data in ways that allow us
Data  to visually observe patterns, exceptions, and the
possible stories behind the raw data.
Visualization
RECAP
Anscombe’s Quartet

 Sampled data sets from “Graphs in statistical analysis”. by F. J.

Anscombe, in American Statistician, 27, 17–21 (1973)
See: Chapter 1, Better Data Visualization.
RECAP
Anscombe’s Quartet

 Sampled data sets from “Graphs in statistical analysis”. by F. J.

Anscombe, in American Statistician, 27, 17–21 (1973)
See: Chapter 1, Better Data Visualization.
 Presenting data in ways that allow us
to visually observe patterns, exceptions, and the
Data

possible stories behind the raw data.

Visualization
 It is much easier to discover and confirm the
presence (or even absence) of patterns,
relationships, and physical characteristics (such
as outliers) through visualization.
 Addresses a variety of needs:
 to evaluate data.

Data 


to communicate to peers.
to convince the board/reviewers.
Visualization  to present to clients.
 to report to regulatory committee.
 Dates back to 2nd Century: Egyptians used
History of maps for Earthly and heavenly positions

Data  10th century: For time series plot of Celestial

bodies

Visualization  14th century: For plotting Mathematical

functions
 William Playfair: (1759-1823) is credited to be
History of the most influential person in data visualization.
Inventor of statistical plots such as
Data


 line chart

Visualization 


bar chart
pie chart
 Record information
 Blueprints, photographs, seismographs, …
 Analyze data to support reasoning
Value of  Develop and assess hypotheses
Discover errors in data
Data


 Expand memory

Visualization  Find patterns

 Communicate information to others
 Share and persuade
 Collaborate and revise
 Record information

Value of 
 Blueprints, photographs, seismographs, …
Analyze data to support reasoning
Data  Develop and assess hypotheses
Discover errors in data
Visualization:


 Expand memory

Record  Find patterns

 Communicate information to others
Information  Share and persuade
 Collaborate and revise
 Napolean’s Disastrous Russian campaign of 1812.
French invasion of Russia
A cartographic depiction
of numerical data on a
map of Napoleon's
disastrous losses
suffered during
the Russian campaign
of 1812. The illustration
depicts Napoleon's army
departing the Polish-
Russian border.

Minard's interest lay with the painful efforts and sacrifices of the soldiers.
 Napolean’s Disastrous Russian campaign of
The graphic is notable 1812.
for its representation in
two dimensions of six
types of data:
1- the size of
Napoleon's troops;
2- distance;
3- temperature;
4- the latitude and
longitude,
5-direction of travel;
and
6- location relative to
specific dates.
See:
https://www.britannica.com/event/French-
invasion-of-Russia
 Napolean’s Disastrous Russian campaign of
The graphic is notable 1812.
for its representation in
two dimensions of six
types of data:
1- the size of
Napoleon's troops;
2- distance;
3- temperature;
4- the latitude and
longitude,
5-direction of travel;
and
6- location relative to
specific dates.

Yet another visualization.

Sankey Chart/Sankey Diagram

 A type of flow diagram in which

the width of the arrows is
proportional to the flow rate of the
depicted property.
 named after Irish Captain
Matthew Henry Phineas Riall
Sankey, who used this type of
diagram in 1898
 Minard's Map is a flow map,
overlaying a Sankey diagram onto
a geographical map.
 It was created in 1869, predating
Sankey's first Sankey diagram of
1898.
Value of  Record information
 Blueprints, photographs, seismographs, …
Data  Analyze data to support reasoning

Visualization: 


Develop and assess hypotheses
Discover errors in data
Analyze data  Expand memory

to support  Find patterns

Communicate information to others
reasoning


 Share and persuade

 Collaborate and revise
 John Snow’s famous Study of 1854 Cholera

Analyze outbreak in Broad Street London.

 Physician John Snow, the supporter of Germ theory
data to and the hypothesis that the Cholera is spreading
through contaminated water, used visualization to

support
prove his hypothesis that contaminated water, not air,
was the source of cholera.

reasoning  This finding came to influence public health and the

construction of improved sanitation facilities beginning
in the mid-19th century.

Src:
• https://en.wikipedia.org/wiki/1854_Broad_Street_cholera_outbreak
• https://www.arcgis.com/home/item.html?id=d7deb67f810d46dfacb80ff80ac224e9
 John Snow’s famous Study of 1854 Cholera
outbreak in Broad Street London.

Analyze
data to
support
reasoning

https://www.arcgis.com/apps/PublicInformation/ind
ex.html?appid=d7deb67f810d46dfacb80ff80ac224e9
 John Snow’s famous Study of 1854 Cholera
outbreak in Broad Street London.

Analyze
data to
support
reasoning
 John Snow’s famous Study of 1854 Cholera
outbreak in Broad Street London.

Analyze
data to
support
reasoning

Snow used his map to convince local authorities to remove the handle of the Broad Street pump that prevented many deaths.
The removal of the Broad Street pump handle has become the stuff of legend. At the Centers for Disease Control (CDC) in
Atlanta, when scientists look for simple answers to questions about epidemics, they sometimes ask each other,
“Where is the handle to this pump?”
Value of  Record information
 Blueprints, photographs, seismographs, …
Data  Analyze data to support reasoning

Visualization: 


Develop and assess hypotheses
Discover errors in data
Communicate  Expand memory

information to  Find patterns

Communicate information to others
others


 Share and persuade

 Collaborate and revise
 Nightingale’s Coxcomb Graph: To
communicate and convince
 Nightingale (lady of the lamp)—a British nurse and
Communicate social reformer, in 1858 developed this chart type to
represent the causes of death of British soldiers
information to during the Crimean War graphically (how people had
died during the period from July, 1854, through the
others end of the following year.
 Less well known is that she was a leading
statistician and pioneer in the visual presentation
of information and statistical graphics.

Crimean was fought between Russia and an ultimately victorious alliance of the
Ottoman Empire, France, the United Kingdom and Piedmont-Sardinia.
Communicate
information to
others
 The circle is divided into twelve equal "slices" representing each
month of the year. Months with more deaths are shown with longer
wedges, so that the area of each wedge represents the number of
deaths in that month from disease (Blue) wounds (Red), or other
causes (Black) . In the second year of the war (shown in the left
image), deaths from disease were greatly reduced, showing the
effect of the improved hygiene in the camps and hospitals starting in
March 1855.
Communicate
information to
others
 Once you see Nightingale's graph, the terrible
picture is clear. The Russians were a minor enemy.
The real enemies were cholera, typhus, and
dysentery. The chart indeed resulted in the
modernization of British army hospital system.
 Interactive visualization allows better
exploration of the data.
 For a visualization to be considered interactive
it must satisfy two criteria:

Interactive  Human input: control of some aspect of the

visual representation of information, or of the

Visualization 
information being represented, and
Response time: changes made through input
must be incorporated into the visualization in a
timely manner. In general, interactive
visualization is considered a soft real-time task.
Interactive
Visualization:
An Example
 Disappearing Shorelines
 Source:
https://archive.nytimes.com/www.nytimes.com/i
nteractive/2012/11/24/opinion/sunday/what-
could-disappear.html
CS F441: Data
 Learn how to examine data and relationship
Course among variables though visualization and
statistical tools with a goal towards
Goal:  Building insight into the data & process that
generated the data.
Exploratory  Finding out what may be interesting.

Data  Determining which variables have the most

predictive power.

Analysis  Assessing and validating assumption in which

future inference will be based.
Visual
Analysis of
Old Faithful
Data

The famous geyser in Yellowstone National Park, Whyoming, USA. See

https://www.wonderopolis.org/wonder/why-is-it-called-old-faithful/
 Old Faithful Data has eruption time and waiting
Visual 
time to next eruption (both in minutes).
Data is an array (Table)
Analysis of  272 element (observations) of 2 variables:

Old Faithful 


eruptions: 3.6 1.8 3.33 2.28 4.53 ...
waiting : 79 54 74 62 85 55 88 85 51 85 ...

data  the name of columns in the data table:

 “eruptions” and “waiting”
 A histogram is used to graphically summarize
and display the distribution of a 1D data set.
 Divides the range into bins and counts the
Visual number of events in each bin (also called
frequency).
Analysis:  The “y” axis in histogram is usually the count or
frequency of a measurement in the
Histogram corresponding bin.
Waiting time
vs Eruption time
 Effective data visualization is an
art and science
Data  Art:Requires design and
communication skills that appeals to
Visualization: viewers at an aesthetic level
 Science:
Art and Requires an understanding
of cognitive science and visual
Science perception. The understanding of
how the eye, brain work together to
processes the information on visual
signals.
 Human Visual functions are
extremely fast and efficient
 Cognitive functions are much
Designing slower and less efficient.
Visualization  Designing visual functions that
should take advantage of the
strengths of visual functions and
help in cognitive function.
 Data Component
 Geometry/Graphic Component

Main  ex: scatter plot, barplot, histogram, smooth

densities, boxplot.

Components  Aesthetics Component

visual cues to represent the information provided
of Data

by the dataset. Ex:

Visualization  Position: two most important cues in this plot are

the point positions on the x-axis and y-axis.
 Color, Shape, Size, …: other visual important cues
 Understanding the differences in data types is
critical:
they determine which statistical analysis will be
Data Types

valid for that data
 what type of plots are appropriate.
Three different data types:
 Nominal Data:
 categories or labels or descriptions

Data Types  Ordinal Data:

 Has associated order/rank.
 Quantitative Data
 Any numbers that express amount or quantity.
 Examples include names, gender, hair color,
product types, …
 a discrete classification of data,
 data are neither measured nor ordered
 but are merely allocated to distinct categories.

Nominal  Can not compare such data:

Data 
 operations: equal, not equal
for example
 a record of students' course choices
constitutes nominal data.
 Male (M), Female (F)
 Hair color: Brown, Black, Blonde, Gray, Other
 Data that can be quantified.
 Data that are answers questions such as “How

Quantitative 
many?”, “How often?”, “How much?”.
In general, 2 categories
data  Continuous: can take of any numeric value
 Discrete:
 ex: Count
 Categorical, statistical data type where the variables have natural,
ordered categories
 There is an order in the values: (operations: equal, not equal;
less/more).
 first place, second place, third place; size S, M, L).
 Note: the distances between the categories is not known. (e.g. a
scale ranging from happy to indifferent to unhappy).
 The ordinal scale is distinguished

Ordinal 


from the nominal scale by having ordered categories.
from continuous scales by not having category widths that

Data
represent equal increments of the underlying attribute.[Wiki]
 Examples:
 Likert Scale
 Answer to survey question "Is your general health poor, reasonable,
good, or excellent?“. The answers may be coded as 1, 2, 3, and 4.
 Individuals income might be grouped into the income categories
$0-$19,999, $20,000-$39,999, $40,000-$59,999,…
 socioeconomic status, military ranks.
 letter grades for coursework.
A Typical
DataSet

Source: https://datacatalog.worldbank.org/dataset/world-development-indicators
A Table of records (rows)
 elements in the same row are related to each
other in the sense that they are all measures
from the same observation---or measures of the
same item.
 Each Record has a number of observations
(columns)
A Typical  Called Items, Dimensions, Variables

Dataset  elements in the same column are related to

each other in the sense that they are all
measures of the same metric
 Dependent or Independent
 Dependent variable is affected by variation of
some other variable: Ex: Date dependent
Temperature
 we may not know which variables are dependent
or independent.
 https://data.gov.in/: Open Govt. Data Platform
India
 https://data.worldbank.org/country/india

Sample 


https://data.oecd.org/india.htm
https://www.indiastat.com/
Data  https://ourworldindata.org/country/india

Sources  https://www.imf.org/external/datamapper/prof
ile/IND
 https://www.kaggle.com/datasets?tags=3023-
India
 …
Recap: Nightingale’s Rose Chart

Both the plots use

ehe same scale.
How much  To some extent

of 


Data Science & Python
General W3Schools Python
Python Tutorial

required?  Python for beginners

FOR CS F411

Quick Python Review

Language Basics

 Python is an interpretive language.

 Python is a strongly-typed and dynamically-typed language.

 Strongly-typed: Interpreter always “respects” the types of each

variable.

 Dynamically-typed: A variable is simply a value bound to a name.

Primitive Types

4 primitive data types  Built-in values:

 Numbers  True, False
 Integers: 5  None
 Floats: 5.2
 booleans : True, False
 Strings: ‘xyz’, “xyz”, ’’’xyz’’’
 ’’’This
is a
multiline string.’’’
Common Operations
>>> x = 10 >>> x / y
>>> y = 3 3.3333333333333335
>>> x + y
>>> x // y
13
3
>>> x - y
7 >>> x % y
1
>>> str(x) + str(y)
'103' >>> x ** y
1000
>>> x * y
30
string facts
 r-string: Raw string
>>> s = 'help '  Retain escape sequences in text
s = r'Hi\nProf’
>>> s * 3
'help help help ' >>> print (s)
Hi\nProf
 f-string: Formatted string. Allows
expression in string literal.
>>> A = f'Ask for {s}.'
>>> print(A)
Ask for help.
f-string makes it easier to read compared to using +
sign to concatenate the string and variable, or
using string formatting operations.
Multiple assignment

 x, y = 10, 5

 Can use multiple assignments to swap variables!

 y, x = x, y
Language Basics

Python is a strongly-typed and dynamically-typed language.

 Strongly-typed: 1 + ‘1’ → Error!
 Dynamically typed:
a = 1 : a is an integer variable
a = ‘str’ : a is a string variable.
Collection types

 4 types of Collection types:

 List
 Tuple
 Dictionary
 Set
Collection: List

Lists are mutable (changeable) arrays

Ex: names = [‘Zach’, ‘Jay’] # note the square bracket.
Indexing to access of individual

>>> names = ['Jack', 'Jill'] >>> names.append('Rick')

>>> names[0] >>> len(names)
'Jack' 3
>>> print (names) >>> print (names)
['Jack', 'Jill'] ['Jack', 'Jill', 'Rick']
>>> len(names) >>> names.extend(['Kevin', 'Adrian'])
2 >>> print (names)
>>> emptyList = [] or list() ['Jack', 'Jill', 'Rick', 'Kevin', 'Adrian']
>>> len(emptyList)
0
List Slicing

 A subsect of List elements can be accessed in convenient ways.

 Basic format:
 some_list[start_index:end_index]

numbers = [0, 1, 2, 3, 4, 5, 6]
numbers[0:3] == [0, 1, 2]
numbers[:3] == [0, 1, 2]

numbers[5:] == [5, 6]
numbers[5:7] == [5, 6]

numbers[:] == [0, 1, 2, 3, 4, 5, 6]
Collection: Tuple

 Tuples are an ordered collection, immutable (unchangeable)

 example: coordinates = (2., 5., 1.) # Note the parenthesis or round brackets.
 Element access is by indexing, like in list. # coordinates[0], coordinates[1], …
 len(coordinates) # returns 3
 emptyTuple = () or tuple()
 oneElemTuple = (1.5, ) #Comma matters!
Collection: Dictionary
Dictionary may be considered as an unordered list with key-value pairs.
 ex: course = { Dictionaty methods:
“number” : “ISC4551”, >>> course.keys()
“name”: “Data Graphics and dict_keys(['number', 'name',
Visualization”, 'classSize'])
“classSize”: 35 >>> course.values()
} dict_values(['ISC4551', 'Data Graphics
 Element access is done by using and Visualization', 35])
key as the index. >>> print(course.items())
 ex: course[“number”] dict_items([('number', 'ISC4551'),
('name', 'Data Graphics and
Visualization'), ('classSize', 35)])
Collection: Set

 An unordered, and unindexed collection.  Set membership test

>>> thisset = {"apple", "banana", >>> print("banana" in thisset)
"cherry"}
True
>>> print(thisset)
>>> print("pineapple" in thisset)
{'apple', 'banana', 'cherry'}
False
 No duplicate items: Sets cannot have two
 Add to set
items with the same value.
>>> thisset = {"apple", "banana", >>> thisset.add("pineapple")
"cherry", "apple"} >>> print("pineapple" in thisset)
>>> print(thisset) True
{'banana', 'apple', 'cherry’}  Alllows union, intersection,difference,
…

ref: https://www.w3schools.com/python/python_sets.asp
Additional Collections

 Available via external packages (or libraries):

 numpy
 pandas
 Must import these packages before using the collections
>>> import numpy
>>> import pandas
 Note: the packages are not natively available with python. Must be
installed independently prior to importing.
 pip install numpy
 pip install pandas
numpy

 NumPy is a Python package used for working with arrays.

>>> arrays = numpy.array([1, 2, 3, 4, 5])
 Difference between list and numpy.array:
 numpy.array
 fixed length
 Homogeneity of array elements
 Occupies continuous memory
 efficient
numpy.array

>>> A = [2,3,4] >>> A+5

>>> nA = numpy.array([2,3,4]) Traceback (most recent call last):
>>> type(nA) File "<stdin>", line 1, in <module>
<class 'numpy.ndarray'> TypeError: can only concatenate list
(not "int") to list
>>> type (A)
<class 'list'>
>>> print(nA+5)
>>> A.append(5)
array([7, 8, 9])
>>> print(A)
[2, 3, 4, 5]
>>> print (nA)
>>> nA.append(5)
[2 3 4]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'numpy.ndarray' object
has no attribute 'append'

More on numpy: https://numpy.org/devdocs/user/quickstart.html

Random number

 NumPy offers the random module to work with random numbers.

>>> from numpy import random
>>> print(random.randint(100)) # integer random number in the range [0, 100)
50
>>> random.randint(100, size=(5)) # integer random 1D array of size 5
array([14, 10, 85, 85, 11])
>>> print (random.rand()) # float random number in the range [0, 1)
0.8974755964397841
>>> print(random.rand(5)) # float random array of size 5
[0.0204437 0.18452146 0.32198496 0.4379359 0.70774701]
Random number (continued)

>>> random.uniform(low=0,high=2,size=5) # uniform distribution in [0,2)

array([0.1142441 , 0.48921499, 0.52053535, 1.39136392, 1.85754155])

>>> random.normal(loc=0,scale = 1, size = 5) # normal distribution

array([ 0.15476279, 0.89379376, 0.80774181, -0.74517664, -0.20817956])
note: loc is the mean and scale is the standard deviation

>>> print( random.choice([3, 5, 7, 9], p=[0.1, 0.3, 0.6, 0.0], size=5))

[7 7 5 5 7]
>>> random.permutation([1,2,3,4,5])
array([5, 4, 1, 2, 3])
Pandas and Data Frame

 Pandas is a Python library used for working with data sets.

 Panel data, which refers to multidimensional structured data sets that
are commonly used in econometrics and statistics.
 It has functions for analyzing, cleaning, exploring, and manipulating
data.
 Used for a wide range of data analysis tasks, such as data cleaning,
transformation, exploration, and mo
 Pandas provides data structures called
 DataFrame (a two-dimensional table-like data structure, like 2D array, or
a table with rows and columns), and
 Series (a one-dimensional labeled array) that make it easier to work with
and analyze data in Python

Ref: https://www.w3schools.com/python/pandas/pandas_intro.asp
DataFrame

>>> import pandas

>>> student1 = {
"courses": ["F441", "F311"],
"grades": ["A", "B+"]
}
>>> df = pandas.DataFrame(student1)
>>> print(df)
courses grades
0 F441 A
1 F311 B+
>>> print(student1)
{'courses': ['F311', 'F311'], 'grades': ['A', 'B+']}
DataFrame

>>> import pandas as pd

>>> student1 = {
"courses": ["F441", "F311"],
"grades": ["A", "B+"]
}
>>> df = pd.DataFrame(student1)
>>> print(df)
courses grades
0 F441 A
1 F311 B+
>>> print(student1)
{'courses': [‘F441', 'F311'], 'grades': ['A', 'B+’]}
Name: courses, dtype: object
DataFrame (contd…)
Accessing columns and rows of a data frame.
>>> student1 = {
>>> print(df.info())
"courses": ["F441", "F311"],
<class 'pandas.core.frame.DataFrame'>
"grades": ["A", "B+"]
RangeIndex: 2 entries, 0 to 1
}
Data columns (total 2 columns):
>>> df = pd.DataFrame(student1)
# Column Non-Null Count Dtype
>>> print(student1)
--- ------ -------------- -----
{'courses': [‘F441', 'F311'], 'grades': ['A',
'B+’]} 0 courses 2 non-null object

Name: courses, dtype: object 1 grades 2 non-null object

Pandas use the loc attribute to return one or more specified dtypes: object(2)
row(s) memory usage: 160.0+ bytes
>>> print(df.loc[0]) None
courses F441
grades A
Name: 0, dtype: object
pandas.read_csv()

>>> df = pandas.read_csv("data.csv")
>>> print(df.head(10))
Duration Pulse Maxpulse Calories
0 60 110 130 409.1
1 60 117 145 479.0
2 60 103 135 340.0
3 45 109 175 282.4
4 45 117 148 406.0
5 60 102 127 300.0
6 60 110 136 374.0
7 45 104 134 253.3
8 30 109 133 195.1
9 60 98 124 269.0
DataFrame (contd…)

>>> df.info() >>> df.describe()

<class 'pandas.core.frame.DataFrame'> Duration Pulse Maxpulse Calories
RangeIndex: 169 entries, 0 to 168 count 169.000000 169.000000 169.000000 164.000000
Data columns (total 4 columns): mean 63.846154 107.461538 134.047337 375.790244
# Column Non-Null Count Dtype std 42.299949 14.510259 16.450434 266.379919
--- ------ -------------- ----- min 15.000000 80.000000 100.000000 50.300000
0 Duration 169 non-null int64 25% 45.000000 100.000000 124.000000 250.925000
1 Pulse 169 non-null int64 50% 60.000000 105.000000 131.000000 318.600000
2 Maxpulse 169 non-null int64 75% 60.000000 111.000000 141.000000 387.600000
3 Calories 164 non-null float64 max 300.000000 159.000000 184.000000 1860.400000
dtypes: float64(1), int64(3)
memory usage: 5.4 KB
>>>
DataFrame (contd…)

>>> df.info() >>> df.describe()

 if-then-else
 for loop
 while loop
Indents

 Code blocks are created using indents.

 Indentation is done using spaces: say 2 or 4 spaces. But should be
consistent throughout the file.
def fib(n):
# Indent level 1: function body
if n <= 1:
# Indent level 2: if statement body
return 1
else:
# Indent level 2: else statement body
return fib(n-1)+fib(n-2)
Flow control: if-then-else

 In Python, there are three forms of the if...else statement.

 if statement
 if...else statement
 if...elif...else statement
python if statement

 Syntax of if statement  Syntax of if-then-else statement

if condition: if condition:
# body of if statement # block of code if condition is True
 example: else:
>>> number = 10 # block of code if condition is False
 example:
>>> if number > 0: >>> number = 10
... print('Number is positive.') >>> if number > 0:
Number is positive. ... print('Positive number')
... else:
... print('Negative number')
Positive number
if...elif...else statement

 Syntax:  Example:
if condition1: >>> number = 0
# code block 1 >>> if number > 0:
elif condition2: ... print("Positive number")
# code block 2 ... elif number == 0:
else: ... print('Zero')
# code block 3 ... else:
... print('Negative number’)

Zero
for loop

 for loop is used to run a block of  Example:

code for a certain number of >>> for i in range(5):
times.
... print(i)
 It is used to iterate over any
0
sequences such as range, list,
tuple, string, etc. 1
2
 Syntax:
3
for val in sequence:
4
# statement(s)

The range() function returns a sequence of numbers, starting from 0 by default, and
increments by 1 (by default), and stops before a specified number.
syntax:
range(start, stop, step)
for loop: unpacking tuples

>>> aList = [(1,10), (2,20), (3,30)]

>>> for x, y in aList:
... print(x,y)

1 10
2 20
3 30
while loop

 while loop is used to run a block of >>> i = 0

code until a certain condition is >>> while i < 5:
met.
... i += 1
... print(i)
 The syntax of while loop is
while condition:
1
# body of while loop
2
3
4
5
Python Function

 A function is a block of code >>> def square(num):

which only runs when it is called. ... return num * num
 Function can be called with data,
known as parameters (or
arguments). >>> for i in [1,2,3]:
 A function can optionally return ... # function call
data as result.
... result = square(i)
 syntax:
def function_name(arguments): ... print(f'Square of {i} = {result}')
# function body
return Square of 1 = 1
Square of 2 = 4
Square of 3 = 9
Python lambda function

 A lambda function is a small >>> x = lambda a : a + 10

anonymous function. >>> print(x(5))
 A lambda function can take any 15
number of arguments but can only
have one expression.
 syntax: >>> x = lambda a, b, c : a + b + c
lambda arguments : expression >>> print(x(5, 6, 2))
13
CS F441
Data Visualization
SUMANTA PATTANAIK
WEEK 2
 Acquisition:
 Download, or
 Manually gather, or

Preparing  Extract from a Database, from a website or some

document and Consolidate

and  Examination:
Metadata
Familarizing


 Completeness

with Data  Quality (errors, unusual data, …)

 Transforming for quality: Cleaning the data
 Transforming for analysis
 Dimension reduction
 Information regarding a data set of interest.
 Provides information that can help in its
interpretation
 the format of individual fields within the data
records.
Data  the base reference point from which some of the
data fields are measured,
Examination:  the units used in the measurements,
the symbol or number used to indicate a missing
Metadata

value,
 and the resolution at which measurements were
acquired.
 Important in selecting the appropriate
preprocessing operations, and in setting their
parameters.
 IMAGE DATA: Image files within this directory contain 2
dimensional views of a male cadaver, as collected for the
National Library of Medicine's Visible Human Program.
 Anatomical Area:
 TOTAL BODY Three sets (t1,t2,pd) of MRI male images.
Type image: GE MRI, Signa v5.2 Frame size: 256, 256
A Sample


 Specifies the image size (width, height) in pixels.

Metadata  Pixel size: SEE FILE HEADER,, Specifies the pixel size
(width, height, separation) in millimeters.
 Image format: GE 16 BITS, Compressed Unix compressed,
use "uncompress [filename]" to restore.
 Header size: 7900. The header block size in bytes.
 Coordinate offset: NONE,NONE
 If images files are cropped to remove empty pixels,
these offsets are provided, in pixels, relative to a fixed
coordinate plane.
 …
 High-quality data needs to pass a set of quality
criteria.
 Validity

Data Quality 


Accuracy
Completeness
 Consistency
 Uniformity

See:
https://en.wikipedia.org/wiki/Data_cleansing#Data_quality
 Validity: The degree to which the data conform to
defined domain rules or constraints.
 Data-Type Constraints: values in a particular column must
be of a particular datatype, e.g., boolean, numeric, date,
etc.
 Range Constraints: typically, numbers or dates should fall
within a certain range.
 Mandatory Constraints: certain columns cannot be

Data Quality
empty.
 Unique Constraints: a field, or a combination of fields,
must be unique across a dataset.
 Set-Membership constraints: values of a column come
from a set of discrete values, e.g. enum values. For
example, a person’s gender may be male or female.
 Regular expression patterns: text fields that have to be in
a certain pattern. For example, phone numbers may be
required to have the pattern (999) 999–9999.
 Cross-field validation: certain conditions that span across
multiple fields must hold. For example, a patient’s date of
discharge from the hospital cannot be earlier than the
date of admission.
 Accuracy: The degree of conformity of a
measure to a standard or a true value
ex: 6 digit Pin Code in an address
Data Quality


 Accuracy is very hard to achieve in the general

case, because it requires accessing an external
source of data that contains the true value: such
"gold standard" data is often unavailable.
 Completeness: The degree to which all
required measures are known.

Data Quality  Incompleteness is almost impossible to fix : one

cannot infer facts that were not captured when
the data in question was initially recorded.
 Ex: User response
 Consistency: The degree to which the data is
consistent, within the same data set or across
multiple data sets.
Data Quality  Inconsistency occurs when two values in the
data set contradict each other.
 Ex: Age and Marital status
 Uniformity: The degree to which a set data
measures are specified using the same units of
Data Quality measure in all systems.
 ex: Weight or height data related to an
international event (say Olympics)
 Data Cleaning: (also called Data wrangling)
the process of detecting and correcting (or
Data

removing) corrupt or inaccurate records from a
record set, table, or database
Cleaning  identifying incomplete, incorrect, inaccurate or
irrelevant parts of the data and
 replacing, modifying, or deleting the dirty data
 Main Steps of Data Cleaning:
 Inspection: Detect unexpected, incorrect, and
inconsistent data.

Data  Cleaning: Fix or remove the anomalies

discovered.
Cleaning  Reporting: A report about the changes made
and the quality of the currently stored data is
recorded.
 Verify and Repeat

Source:
https://towardsdatascience.com/the-ultimate-guide-to-
data-cleaning-3969843991d4
 Inspection: Detect unexpected, incorrect,
and inconsistent data.
 Data profiling: Generate A summary
statistics about the data
 Is the data column recorded as a string or

Data 
number?.
How many values are missing?
Cleaning  How many unique values in a column, and their
distribution?
 Statistical Analysis and Visualization of Distribution
 mean, standard deviation, mean, range, or
quantiles.
Source:
https://towardsdatascience.com/the-ultimate-guide-to-
data-cleaning-3969843991d4
 Cleaning: Fix or remove the anomalies
discovered.
 Missing Values:
Drop Row: missing values in a column rarely
Data

happen and occur at random

Cleaning
 Drop Column: most of the column’s values are
missing, and occur at random
 Assign Value (impute): Mean/Median value or
prediction using Linear regression.
 Do nothing but Flag

Source:
https://towardsdatascience.com/the-ultimate-guide-to-
data-cleaning-3969843991d4
 Cleaning: Fix or remove the anomalies
discovered. (Continued)
 Remove Irrelevant/Duplicate data

Data  Convert data type: ex: “Make sure numbers

are stored as numerical data types”

Cleaning  Massage string data: Fix typos, remove extra

white space, Capitalize etc..
 Standardize data: Same unit of
measurement (ex: M or cm or mm),
European or USA version (date format)
Reporting: A report about the changes made
Data

and the quality of the currently stored data is
recorded.
Cleaning

Source:
https://towardsdatascience.com/the-ultimate-guide-to-
data-cleaning-3969843991d4
Tools:
Data 
 Drag and Drop Tools
Cleaning  Script Based
 Trifacta Wrangler: Messy Data Accepted
https://www.trifacta.com/products/wrangler/

Drag-and-  Originally Stanford/Berkeley Data Wrangler

project
Drop app for
Data
Cleaning
Drag-and- OpenRefine: http://openrefine.org/
Drop App Originally Google Refine.

for Data Links to other free/commercial Data cleaning

Cleaning tools can be found from

https://en.wikipedia.org/wiki/Data_cleansing
Based on  Functions in Python Pandas
Javascript Array functions and
Scripting

additional Tools in D3
Languages  Functions in R Tidyverse
Visualization Tools
Two
Drag-and-Drop Tools:
Categories


 Based on Scripting Languages

of Tools
 Drag-and-Drop Tools:
 Rely on a Graphical User Interface
 Make assumptions about what you may like to
Two do
 Ex: You may draw SALES to Y-axis and DATE to the
Categories X-axis. The tools assumes that you are interested in
graphing total sales per month.

of Tools  Examples: Tableau, PoweBI

 Based on Scripting Languages
 You decide what and how you want to create
visualization.
#1 most-used Business Intelligence software tool
 Pro: Simple and easy to be a beginner user
 connect to your data source, whether an Excel
file, a database connection, or any of
the dozens of other connection options
 drag the variable names you want onto a graph
object (a “sheet”) and customize as you see fit.

Tableau  combine sheets into a dashboard in whatever

configuration you like, and get creative with
parameters, filters, or other customization
options.
 Allows public and private hosting
 Free license for students
 Cons: “A minute to learn, a lifetime to master!”
 Less flexible
Picking up to be the most successful analytic and
business intelligence platform
 Pro: Like Tableau Simple and easy to be a
beginner user
 connect to your data source, whether an Excel file,

Microsoft a database connection, or any of the dozens of

other connection options

Power BI  drag the variable names you want onto a graph

object (a “sheet”) and customize as you see fit.
 Easy integration with analysis

 Cons:
 Development time can be long
 Expensive
 Drag-and-Drop Tools:
Two  Based on Scripting Languages

Categories
 Ex: matplotlib and Plotly in Python; D3,
Observable Plot, Plotly in Javascript, ggplot,
Plotly in R, …
of Tools  Better control on the result, but you have to be
explicit about what you want.
This class will
Python: Scripting language
use Scripting


 Plotly and matplotlib : For data visualization

Languages
 Matplotlib is a popular Python library for creating visualizations.
 matplotlib.pyplot: This is the primary module used for creating
visualizations. It provides a simple interface for creating plots and
charts
 Pros:
 Versatile and Customizable
 Wide Adoption: Being one of the oldest and most popular plotting
libraries in Python, Matplotlib is widely adopted and often
considered the starting point for many data visualization tasks.
 Cons:

Matplotlib 


Steep Learning Curve.
Limited interactivity
 Its default styles might not always produce the most
aesthetically pleasing plots compared to some other
libraries.
 As in any scripting-based visualization tool, a lot of code to
write.
 We will mostly use Pandas.plot and Seaborn: Libraries
developed on the top of MatplotLib. Reduce coding
load.

https://matplotlib.org/
 Open Source graphics library for creating interactive,
publication-quality graphs. It has a concise and (hopefully)
memorable functions to foster fluency
 Pros:
 Interface is available to Python, R, Javascript, Matlab, Julia

Plotly 


Great support for interaction
Beautiful visualizations
 Cons:
 As in any scripting-based visualization tool, a lot of
code to write.
 We will mostly use Plotly.express: A library developed
on the top of Plotly. Reduces coding load.

https://plotly.com/graphing-libraries/
Course  Learn how to examine data and relationship
among variables though visualization and
Goal: statistical tools with a goal towards

Exploratory
 Building insight into the data & process that
generated the data.

Data 


Finding out what may be interesting.
Determining which variables have the most

Analysis predictive power.

Plotly and MatplotLib for Pandas
 Independently specify building blocks and
combine them to create just about any kind of
graphical display you want.
 Building blocks of a graph:
 data

Grammar of 


geometric shapes
coordinate system
Graphics  aesthetic mapping:
 Mapping of data dimensions to visual dimensions
 Scales
 statistical transformations
 position adjustments
 faceting
Example

import pandas as pd
alphabets = pd.read_csv( "english-letter.csv")
Example
Columns/Channels/Data Dimensions

import pandas as pd
alphabets = pd.read_csv( "english-letter.csv")

Data Array/
Data Frame/
Data Table
Plotly Example
import pandas as pd
import plotly.express as px
alphabets = pd.read_csv( "english-letter.csv")

Geometric Data Array

shapes

px.bar(alphabets,
x="letter",
Data Channels
y="frequency“
)
Pandas Matplotlib Example
import pandas as pd
alphabets = pd.read_csv( "english-letter.csv")
alphabets.plot.bar(x="letter",y="frequency",
figsize = (10,5), rot=0, width=0.8)

Geometric
Data Array shapes

alphabets.plot.bar(
x="letter",
Data Channels
y="frequency“
)
Plotly Example

import plotly.express as px
iris = px.data.iris()

Datasets: https://plotly.com/python-api-reference/generated/plotly.data.html
Plotly Example

import plotly.express as px
iris = px.data.iris()

fig = px.scatter(
iris,
x="sepal_width",
y="sepal_length")
fig.show()
Plotly Example

import plotly.express as px
iris = px.data.iris()

Geometric
shapes

fig = px.scatter(
iris,
x="sepal_width",
y="sepal_length")
fig.show()
Plotly Example

import plotly.express as px
iris = px.data.iris()

fig = px.scatter(iris,
x="sepal_width",
y="sepal_length“,
color=“species”
)
fig.show()
Plotly Example

import plotly.express as px
iris = px.data.iris()

Geometric
shapes

fig = px.scatter(iris,
x="sepal_width",
y="sepal_length“,
color=“species”
) Data
Dimensions
fig.show()

Visual Attributes/ Visual Dimensions

Matplotlib (Seaborn) Example

import plotly.express as px
iris = px.data.iris()

fig, ax = plt.subplots(
figsize=(10, 5)
)
sns.scatterplot(data=iris,
x="sepal_width",
y="sepal_length",
hue = "species“,
ax=ax)
Plotly Example

import plotly.express as px
iris = px.data.iris()
fig = px.scatter(iris,
x="sepal_width",
y="sepal_length“,
color=“species”,
color_discrete_sequence=
px.colors.qualitative.Light24
)
fig.show() Mapping function or
scaling function
Data Graphics types
Both Plotly and Matplotlib support a large number of data
graphics types.
 Commonly used ones are
 Basic charts
 line chart
 bar chart
 scatter plot, bubble chart
 pie chart
 Statistical charts
 histogram
 box plot
 violin plot
 error plot
 distribution plots
 geo plots
 choropleths
 geobubble plots
 …
 Data Component
 Geometry/Graphic Component

Main  ex: scatter plot, barplot, histogram, smooth

densities, boxplot.

Components  Aesthetics Component

visual cues to represent the information provided
of Data

by the dataset. Ex:

Visualization  Position: two most important cues in this plot are

the point positions on the x-axis and y-axis.
 Color, Shape, Size, …: other visual important cues
Scales

 Scales map data Onto aesthetics

 a scale must be one-to-one
Location and Coordinate System

 Cartesian Coordinates:
 2D Cartesian coordinates is the
widely used in data visualization
 Axes are orthogonal
 Represent both positive and
negative real numbers.
Example

MTCar Dataset:
Motor Trend magazine 1974
Effectiveness of Various Visual
properties for Data

 some graphical properties are more effective than others when it

comes to conveying information.
Effectiveness of Various Visual
properties for Quanitative Data
 Order of effectiveness in mapping :
 Position on the axis
 Length/width of line
 Area of shape
 Color Saturation
Effectiveness of Various Visual
properties for Ordinal Data
 Order of effectiveness in mapping :
 Position on the axis
 Color Saturation
 Color Hue/Texture
 Shape

Source: Mackinlay et al 1986

Effectiveness of Various Visual
properties for Nominal Data
 Order of effectiveness in mapping :
 Position on the axis
 Color Hue
 Texture
 Shape

Source: Mackinlay et al 1986

Mapping and Scales
Aesthetics

 Visualization of all types of data

needs a graphic element
 Dot, line, box, text, pie…
 Aesthetics describe various
aspects of graphic elements
 See examples
 Some can represent continuous
data, some can represent discrete
data and some can represent the
both.
Scale Functions

 Scale functions map a dimension of abstract data to its visual

 SQRT Scale: Uses a function linear in sqrt of x. (y = m * sqrt(x) + b)

 Log Scale: Uses a function linear in log of x. (y = m * log(x) + b)

 Log scale

https://clauswilke.com/dataviz/coordinate-systems-axes.html
Continuous Input Continuous
Output
 Polar (Curvilinear) Scale
 Linear Transformation of data values into
angles and radial distances from origin.
 use a linear function ( = m * x + b) to
interpolate across the domain and range
 Generally, one od the axis is assigned to the
discrete input data
 Domain and Range:
 Domain: [0,maxCoordinate]
 Range: ([0, 2])

 Use: For cyclic data,

https://clauswilke.com/dataviz/coordinate-systems-axes.html
Continuous data to continuous
Color

https://clauswilke.com/dataviz/coordinate-systems-axes.html
Scale Functions

 Scales with discrete input and discrete output

 maps discrete values (specified by an array) to discrete values (also
specified by an array). The domain array specifies all possible input
values and the range specifies all possible output values. The range
array will repeat if it’s shorter than the domain array..
 Ex: colorScale
 Domain: list/array of categorical values
 Range: List/array of positions, colors, …
Discrete Input to Discrete Output

 discrete set of Categorical values to equally spaced points along

the specified range.
Discrete Input to Distinguishable
Color Output Categorical Data

https://clauswilke.com/dataviz/coordinate-systems-axes.html
Summary of Data and types
Components of Data
Visualization
 Data Component
 Geometry/Graphic Component
Main  ex: lines, bars, symbols.

Components  Aesthetics Component

 visual cues to represent the information provided
of Data by the dataset. Ex:
 Position: two most important cues in this plot are
Visualization the point positions on the x-axis and y-axis.
 Color, Shape, Size, …: other visual important cues
Example
Columns/Channels/Data Dimensions

import pandas as pd
alphabets = pd.read_csv( "english-letter.csv")

Data Array/
Data Frame/
Data Table
Pandas Matplotlib Example
import pandas as pd
alphabets = pd.read_csv( "english-letter.csv")
alphabets.plot.bar(x="letter",y="frequency",
figsize = (10,5), rot=0, width=0.8)

Geometric
Data Array shapes

alphabets.plot.bar(
x="letter",
Data Channels
y="frequency“
)
Plotly Example
import pandas as pd
import plotly.express as px
alphabets = pd.read_csv( "english-letter.csv")

Geometric Data Array

shapes

px.bar(alphabets,
x="letter",
Data Channels
y="frequency“
)
Plotly Example

import plotly.express as px
iris = px.data.iris()

Datasets: https://plotly.com/python-api-reference/generated/plotly.data.html
Plotly Example

import plotly.express as px
iris = px.data.iris()

fig = px.scatter(
iris,
x="sepal_width",
y="sepal_length")
fig.show()
Plotly Example

import plotly.express as px
iris = px.data.iris()

Geometric
shapes

fig = px.scatter(
iris,
x="sepal_width",
y="sepal_length")
fig.show()
Plotly Example

import plotly.express as px
iris = px.data.iris()

fig = px.scatter(iris,
x="sepal_width",
y="sepal_length“,
color=“species”
)
fig.show()
Plotly Example

import plotly.express as px
iris = px.data.iris()

Geometric
shapes

fig = px.scatter(iris,
x="sepal_width",
y="sepal_length“,
color=“species”
Data
) Dimensions

Visual Attributes/ Visual Dimensions

Matplotlib (Seaborn) Example

import plotly.express as px
iris = px.data.iris()

fig, ax = plt.subplots(
figsize=(10, 5)
)
sns.scatterplot(data=iris,
x="sepal_width",
y="sepal_length",
hue = "species ")
Plotly Example

 some graphical properties are more effective than others when it

Source: Mackinlay et al 1986

Effectiveness of Various Visual
properties for Nominal Data
 Order of effectiveness in mapping :
 Position on the axis
 Color Hue
 Texture
 Shape

Source: Mackinlay et al 1986

Mapping and Scales
Mapping of Data to Aesthetics

 Visualization of all types of data

needs a graphic element
 Dot, line, box, text, pie…
 Aesthetics describe various
aspects of graphic elements
 See examples shown in the figure
 Some can represent continuous
data, some can represent discrete
data and some can represent the
both.
Scales

 Scales map data Onto aesthetics

 a scale must be one-to-one
Location and Coordinate System

 Cartesian Coordinates:
 2D Cartesian coordinates is the
widely used in data visualization
 Axes are orthogonal
 Represent both positive and
negative real numbers.
Example

MTCar Dataset:
Motor Trend magazine 1974
Scale Functions

 Scale functions map a dimension of abstract data to its visual

representation.
 take an input in a certain interval called domain: Data Dimension
 a number, date or category
 return output in another interval: Visual property
 position coordinate
 a length or a radius
 a color
 Considers 2 types of data
 Contiuous: Quantitative
 Discrete:
 explicit set of values
 Ordinal, Categorical
Continuous Input Continuous
Output: Linear scale
 Most suitable scale for
transforming data values into
positions and lengths
 use a linear function (y = m * x + b)
to interpolate across the domain
and range
 Domain and Range:
 Domain:
[minCoordinate,maxCoordinate]
 Range: ([0, width])
Continuous Input Continuous
Output: log scale
 most suitable scale for
transforming data values when the
data represents a large range,
such as an exponential growth
rate.
 Ex: Population of countries,
revenue of companies.
 use a sqrt function (y = m * log(x) +
b) to interpolate across the
domain and range
 Domain and Range:
 Non-negative non-zero Domain:
[minCoordinate,maxCoordinate]
 Range: ([0, width])
Continuous Input Continuous
Output: sqrt scale
 Most suitable scale for transforming
data values into diameter of dot.
 use a sqrt function (y = m * sqrt(x) +
b) to interpolate across the domain
and range
 Domain and Range:
 Non-negative Domain:
[minCoordinate,maxCoordinate]
 Range: ([0, width])
Continuous Input Continuous
Output
 Linear Scale: Uses a linear function (y = m * x + b)

 SQRT Scale: Uses a function linear in sqrt of x. (y = m * sqrt(x) + b)

 Log Scale: Uses a function linear in log of x. (y = m * log(x) + b)

 Log scale

https://clauswilke.com/dataviz/coordinate-systems-axes.html
Gapminder Data Source:
https://plotly.com/python/plotly-express/
Continuous Input Continuous Output:
Polar (Curvilinear) Scale
 Linear Transformation of data values into
angles and radial distances from origin.
 use a linear function ( = m * x + b) to
interpolate across the domain and range
 Generally, one od the axis is assigned to
the discrete input data
 Domain and Range:
 Domain: [0,maxCoordinate]
 Range: ([0, 2])
 Use: For cyclic data,

https://clauswilke.com/dataviz/coordinate-systems-axes.html
Nightingale’s
Rose plot
Radial Bar Star plot
Plot (Spider plot)

Source: https://support.apple.com/en-in/guide/watch- Source: https://datavizcatalogue.com/methods/radar_chart.html

ultra/apde9218b440/watchos
Continuous data to continuous
Color

https://clauswilke.com/dataviz/coordinate-systems-axes.html
Continuous data to
continuous Color
Continuous data to
continuous Color

Source: The scale and drivers of carbon footprints in households, cities and regions across India
January 2021 Global Environmental Change 66(11):102205
Scale Functions

 Scales with discrete input and discrete output

 discrete set of Categorical values to equally spaced points along

the specified range.
Discrete Input to Distinguishable
Color Output Categorical Data

https://clauswilke.com/dataviz/coordinate-systems-axes.html
Gapminder plot

Source:
https://www.gapminder.org
/tag/chart/
Summary of Data and types
Common, Effective
Data Visualization
Techniques
1D DATA
1D Data

Iris Data Set

iris = px.data.iris()
Visualizing 1D Data

 Tabulation in order
 Nominal Data: Alphabetic order
 Ordinal and Quantitative data:
Could be increasing or decreasing
Visualizing 1D Data

 nominal data
 Tabulate
 Sorted Tabulate
 Total Count of unique category

# Count duplicates
tips = px.data.tips()
tips["day"].value_count()
1D Data Visualize

 nominal data
 Tabulate
 Sorted Tabulate
 Total Count of unique category
 Bar plot (Histogram plot)
1D Quantitative Data

 iris dataset

iris = px.data.iris()
1D Quantitative Data

Iris Data Set

1D Quantitative Data

Old Faithful
Data Set
Distribution  Examining sets of quantitative values:
 How are the values distributed from
Analysis lowest to highest?

Ref: Chapter 6. “Distribution” in Better Data Visualization

1D Quantitative Data

 iris dataset
 Distribution analysis
 3 Key Characteristics of Distribution
 Spread: The difference between the
maximum value and the minimum value (or
the range of the data).
 Wide or short

Distribution  Center: The mean/median of the data

 Mean: Mathematical average
Analysis  Median: One half of the data lie on one
side of the median and the other half on
the other side.
 Shape:
 Symmetric/uniform/skewed
 Unimodal/multimodal
Data
Distribution
 Spread
 Range of Data
 Difference between the
largest and smallest
point

 Interquartile range:
 Measure of difference
between upper (75%)
and lower quartile (25%)
 Where the majority
values lie.

Source: https://www.onlinemathlearning.com/quartile.html
Box plot: Distribution visualization
Histogram: Distribution Visualization

 Histogram:
 A better way to visualize the
distribution of 1D data. Often used
in statistical analysis.
 shows the number of data points
(frequency) that lie within intervals,
called bins
 visualized as a collection of
rectangles. The frequency and the
width of the bin interval represent
the height and width of a
rectangle.
Histogram: Distribution Visualization

 Y-axis can represent:

 Count
 Percent
 Probability Density
Histogram: Distribution Visualization

 Statistical parameters
shown on the histogram
Histogram

 Best Practices
 Keep interval constant
 Select best interval

Source: https://statistics.laerd.com/statistical-guides/understanding-
histograms.php
Histogram: Distribution Visualization

 Histogram with continuous

density plot overlapped.
Density Plot

 A Density Plot visualizes the

distribution of data over a
continuous interval or time period.
 Uses kernel smoothing to plot
values, allowing for smoother
distributions by smoothing out the
noise.
 The peaks of a Density Plot help
display where values are
concentrated over the interval.
Visualizing 1D Quantitative Data

 Density Plot: Smoothening Histogram. Density plot over Histogram of Eruption Durations
from Old Faithful dataset.
 Resources:
 Wiki page on Kernel Density Estimation:
https://en.wikipedia.org/wiki/Kernel_densit
y_estimation
 https://mathisonian.github.io/kde/
Density plot of a Normal distribution

 Spread
 Range of Data
 Difference between the largest and
smallest point

 Interquartile range:
 Measure of difference between
upper (75%) and lower quartile (25%)
 Where the majority values lie.

 Mean and Standard deviation:

 +/- 1 SD: 68%
 +/- 2 SD: 95%
 +/- 3 SD: 99.7%

Source: https://www.onlinemathlearning.com/quartile.html
Data
Distribution
 Shape
 Symmetric
 Uniform
 Skewed

Source: https://towardsdatascience.com/intro-to-descriptive-statistics-
252e9c464ac9
Data
Distribution
 Shape
 Symmetric
 Uniform
 Skewed
 Multi-modal

Source: https://towardsdatascience.com/intro-to-descriptive-statistics-
252e9c464ac9
Distribution
Display
 Single distribution display
 Histogram
 Frequency polygon:
 A line graph that
exclusively draws
attention to the
distribution’s shape with
minimal distraction

 Strip plot
 Stem-and-leaf plot
Frequency Polygon

 uses the same two axes as a

histogram,
 is constructed by placing a point
at the center of each interval such
that the height of the point is
equal to the frequency or relative
frequency associated with that
interval.
 Points may also placed on the
horizontal axis at the midpoints of
the intervals
Distribution
Display
 Single distribution display
 Histogram
 Frequency polygon:
 Strip plot
 1-D scatter plot

 Stem-and-leaf plot
Visualizing 1D Quantitative Data

 Strip Plot:
 the viewer gets an idea
about the range(s) of
values that are more
frequent and those that are
less frequent

Point plot (top) and Jitter plot (bottom) of Iris

Sepal length.
Distribution
Display
 Single distribution display
 Histogram
 Frequency polygon:
 Strip plot
 Stem-and-leaf plot
 Similar to histogram
 Data is sorted and
divided into stem interval
at appropriate places
 Works well for small data
set.
Violin Plot for 1D Data Visualiation

 Violin plots are similar to box plots,

except that they also show the
probability density of the data at
different values.
Anatomy of
Violin Plot
 a combination of a Box Plot and
a Density Plot.
 The white dot in the middle is the
median value and the thick black
bar in the center represents the
interquartile range.
 Sometimes the plot is clipped from
the end of thin boxplot line.

Ref: How to Interpret Violin Plot

 Box Plots are limited in their display of the data,
as their visual simplicity tends to hide significant
details about how values in the data are
Violin plot vs distributed.

Box plot  For example, with Box Plots, you can't see if the
distribution is bimodal or multimodal.
 Violin Plots display more information, they can be
noisier than a Box Plot.

https://datavizcatalogue.com/methods/violin_plot.html
Multiple Distribution Display

Iris Data Set

Distribution
Display
 Multiple distribution display
 Box Plots
 Violin Plots
 Multiple Strip plots
 Frequency Polygons

Source: https://datavizcatalogue.com/methods/box_plot.html
Distribution
Display
 Multiple distribution display
 Box Plots
 Violin Plots
 Multiple Strip plots
 Frequency Polygons

Source: https://blogs.sas.com/content/graphicallyspeaking/2013/03/24/custom-box-plot
Distribution
Display
 Multiple distribution display
 Box Plots
 Violin Plots
 Multiple Strip plots
 Frequency Polygons

https://datavizcatalogue.com/methods/violin_plot.html
Distribution Display

 Multiple distribution display

 Box Plots
 Violin Plots
 Multiple Strip plots
 Frequency Polygons
Distribution Display

 Multiple distribution display

 Box Plots
 Violin Plots
 Multiple Strip plots
 With jitter

 Frequency Polygons
Distribution
Display
 Multiple distribution display
 Box Plots
 Violin Plots
 Multiple Strip plots
 Frequency Polygons

Comparing City Mileage levels of 4 types of car

Population Pyramid

 A Population Pyramid
is a pair of back-to-
back Histograms (for
each sex) that
displays the
distribution of a
population in all age
groups and in both
sexes.
Population Pyramid

 The X-axis is used to plot

population numbers and the Y-axis
lists all age groups.
 Population Pyramids are ideal for
detecting changes or differences
in population patterns.
 Multiple Population Pyramids can
be used to compare patterns
across nations or selected
population groups or across
time,…

https://datavizcatalogue.com/methods/population_pyramid.html

UiPath Cheat Sheet
100% (4)
UiPath Cheat Sheet
3 pages
Top 100 C# Interview Questions and Answers
100% (3)
Top 100 C# Interview Questions and Answers
3 pages
L01 Intro
No ratings yet
L01 Intro
47 pages
Data Visualization Techniques: Dr. D. Koteswara Rao
No ratings yet
Data Visualization Techniques: Dr. D. Koteswara Rao
41 pages
Data Visualization - Lecture 01
No ratings yet
Data Visualization - Lecture 01
60 pages
Data Visualization - Lecture 01
No ratings yet
Data Visualization - Lecture 01
64 pages
BANA 571 Week 7 Tableau
No ratings yet
BANA 571 Week 7 Tableau
59 pages
DV Co1 All PDF
No ratings yet
DV Co1 All PDF
196 pages
Unit 4 - Data Visualization
No ratings yet
Unit 4 - Data Visualization
32 pages
Data Visualization 1
No ratings yet
Data Visualization 1
61 pages
1152cs191 Data Visualization Unit I
No ratings yet
1152cs191 Data Visualization Unit I
129 pages
DAT100 - Int - Data - Ana - Lec11 - Visualization
No ratings yet
DAT100 - Int - Data - Ana - Lec11 - Visualization
33 pages
Fe 550
No ratings yet
Fe 550
4 pages
Data Visualization
No ratings yet
Data Visualization
23 pages
Pushpak BBA Using R
No ratings yet
Pushpak BBA Using R
12 pages
2.1 Introduction To Data Visualization
No ratings yet
2.1 Introduction To Data Visualization
16 pages
Data Visualization in Data Science
100% (6)
Data Visualization in Data Science
34 pages
DV Unit-1
No ratings yet
DV Unit-1
8 pages
Introduction To Data Visualisation
100% (1)
Introduction To Data Visualisation
47 pages
What Is Data Visualization UNIT-V
No ratings yet
What Is Data Visualization UNIT-V
24 pages
DV Chapter 1
No ratings yet
DV Chapter 1
25 pages
Lecture 1 1
No ratings yet
Lecture 1 1
103 pages
Chapter 5 - Big Data Implementation Part 2 (Data Visualization)
No ratings yet
Chapter 5 - Big Data Implementation Part 2 (Data Visualization)
50 pages
1.data Handling and Visualization Module 1 Slides
No ratings yet
1.data Handling and Visualization Module 1 Slides
51 pages
Data Science Visualization in R
No ratings yet
Data Science Visualization in R
42 pages
Data Visualization1
No ratings yet
Data Visualization1
5 pages
DAS732 Lecture - 2024 08 01
No ratings yet
DAS732 Lecture - 2024 08 01
38 pages
Introduction To Information Visualization
No ratings yet
Introduction To Information Visualization
44 pages
Unit III
No ratings yet
Unit III
105 pages
Intro Visualization
No ratings yet
Intro Visualization
46 pages
01 Introduction
No ratings yet
01 Introduction
51 pages
DM14 Visualisation
100% (1)
DM14 Visualisation
67 pages
Visualization Charts
No ratings yet
Visualization Charts
108 pages
IDV-01-Course Overview
No ratings yet
IDV-01-Course Overview
100 pages
Week1 Introduction Notes
No ratings yet
Week1 Introduction Notes
63 pages
VISU2
No ratings yet
VISU2
99 pages
Data Visualization Techniques Traditional Data To Big Data
No ratings yet
Data Visualization Techniques Traditional Data To Big Data
23 pages
08 Introduction To Data Visualisation
No ratings yet
08 Introduction To Data Visualisation
30 pages
Chapter 6 Introduction To Data Visualization - Introduction To Data Science
No ratings yet
Chapter 6 Introduction To Data Visualization - Introduction To Data Science
4 pages
Decap782 Advance Data Visualization
No ratings yet
Decap782 Advance Data Visualization
368 pages
Unit 1 Data Objects Attributes Visualization
No ratings yet
Unit 1 Data Objects Attributes Visualization
34 pages
Unit - 1 DV
100% (1)
Unit - 1 DV
10 pages
Lecture 1
No ratings yet
Lecture 1
68 pages
Group - 2
No ratings yet
Group - 2
29 pages
Unit 1 Intro and Theory I
No ratings yet
Unit 1 Intro and Theory I
85 pages
DV Lab Manual (Ex - No.1-10)
No ratings yet
DV Lab Manual (Ex - No.1-10)
23 pages
Data Visu1111
No ratings yet
Data Visu1111
36 pages
DMV - Unit 3 & 4
No ratings yet
DMV - Unit 3 & 4
32 pages
DVI - CS1 - Introduction, EDA and EA - Without Annotation
100% (1)
DVI - CS1 - Introduction, EDA and EA - Without Annotation
78 pages
Why Data Visualizations DashingD3js - Com
No ratings yet
Why Data Visualizations DashingD3js - Com
10 pages
Chapter 1 - 1
No ratings yet
Chapter 1 - 1
44 pages
DATA4
No ratings yet
DATA4
259 pages
Unit 1 Total Data Visualization Techniques
No ratings yet
Unit 1 Total Data Visualization Techniques
22 pages
B.Tech Iii Year Sem-1 Academic Year - July, 2021: PE-1 Data Visualization Techniques Course Code: 19Cs3051S
No ratings yet
B.Tech Iii Year Sem-1 Academic Year - July, 2021: PE-1 Data Visualization Techniques Course Code: 19Cs3051S
23 pages
DS351 DataViz Intro
No ratings yet
DS351 DataViz Intro
49 pages
Introduction To Data Visualization
No ratings yet
Introduction To Data Visualization
50 pages
Module4 DSV
No ratings yet
Module4 DSV
89 pages
Data Visualization
No ratings yet
Data Visualization
17 pages
IDV-02-Data Foundations
No ratings yet
IDV-02-Data Foundations
208 pages
Data Visualization
No ratings yet
Data Visualization
19 pages
PVSNP: Lecture 1: Introduction To The P Vs NP Problem
No ratings yet
PVSNP: Lecture 1: Introduction To The P Vs NP Problem
39 pages
Calculus II
No ratings yet
Calculus II
116 pages
BITS F464 ML Lecture Notes
No ratings yet
BITS F464 ML Lecture Notes
86 pages
Lecture 1.1
No ratings yet
Lecture 1.1
39 pages
7 - Phase Estimation and Factoring
No ratings yet
7 - Phase Estimation and Factoring
44 pages
Week-3 Lists, Searching and Sorting
No ratings yet
Week-3 Lists, Searching and Sorting
115 pages
DSA Sheet
No ratings yet
DSA Sheet
9 pages
Inter-Process Communication - Wikipedia, The Free Encyclopedia
No ratings yet
Inter-Process Communication - Wikipedia, The Free Encyclopedia
4 pages
Monitoring IDocs With The SAP Application Interface Framework
No ratings yet
Monitoring IDocs With The SAP Application Interface Framework
11 pages
Audio Processing in Matlab Simulink
No ratings yet
Audio Processing in Matlab Simulink
42 pages
EWIT - 3rd Sem Certificate - BRANCH WISE-with Number
No ratings yet
EWIT - 3rd Sem Certificate - BRANCH WISE-with Number
247 pages
Chapter-2-Data Structures and Algorithms Analysis
100% (2)
Chapter-2-Data Structures and Algorithms Analysis
44 pages
Lecture 3 - Chap - 4
No ratings yet
Lecture 3 - Chap - 4
56 pages
Chapter 1 UNIT I TO V (B)
No ratings yet
Chapter 1 UNIT I TO V (B)
36 pages
Moving Forth - Part 5
No ratings yet
Moving Forth - Part 5
1 page
Java Swing Tutorial - Javatpoint
No ratings yet
Java Swing Tutorial - Javatpoint
5 pages
Mapping The Data Warehouse
No ratings yet
Mapping The Data Warehouse
16 pages
Kaa1a1414160r02 20241102 1902
No ratings yet
Kaa1a1414160r02 20241102 1902
11 pages
Dev Force 2010 Developers Guide
No ratings yet
Dev Force 2010 Developers Guide
478 pages
Create Innovus
No ratings yet
Create Innovus
2 pages
Programming and Computer Application Practices: Wildan Aghniya 17063048
No ratings yet
Programming and Computer Application Practices: Wildan Aghniya 17063048
5 pages
PG TRB Os Class6
No ratings yet
PG TRB Os Class6
40 pages
Agile Visualization
No ratings yet
Agile Visualization
235 pages
Next Generation ABAP Runtime Analysis (SAT) - Introduction - SAP Blogs
No ratings yet
Next Generation ABAP Runtime Analysis (SAT) - Introduction - SAP Blogs
21 pages
Programming Fundamentals Lab CSL-112: Instructor: Engr. Qasim Hussain Midterm Exam
No ratings yet
Programming Fundamentals Lab CSL-112: Instructor: Engr. Qasim Hussain Midterm Exam
9 pages
Exp 4
No ratings yet
Exp 4
2 pages
Class 8 Qbasic Notes
100% (2)
Class 8 Qbasic Notes
5 pages
Verification of Producer-Consumer Synchronization in GPU Programs
No ratings yet
Verification of Producer-Consumer Synchronization in GPU Programs
11 pages
H30 LIS Protocol
No ratings yet
H30 LIS Protocol
11 pages
Methods For Solving Tridiagonal System
No ratings yet
Methods For Solving Tridiagonal System
4 pages
Mayhoc
No ratings yet
Mayhoc
51 pages
41-Computer Science Exam - 2
No ratings yet
41-Computer Science Exam - 2
7 pages
Set3 PDF
No ratings yet
Set3 PDF
23 pages
Fundamentals of Algorithms CS - 201: Lab Manual 2
No ratings yet
Fundamentals of Algorithms CS - 201: Lab Manual 2
8 pages
FSD Module-05
No ratings yet
FSD Module-05
38 pages