0% found this document useful (0 votes)
15 views208 pages

IDV-02-Data Foundations

Uploaded by

Dr Gnaneswari G
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views208 pages

IDV-02-Data Foundations

Uploaded by

Dr Gnaneswari G
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 208

Interactive Data Visualization

02
Data Foundations

IDV 2019/2020
Notice

! Author

" João Moura Pires ([email protected])

! This material can be freely used for personal or academic purposes without

any previous authorization from the author, provided that this notice is kept

with.

! For commercial purposes the use of any part of this material requires the

previous authorization from the author.

Data Foundations - 2
Bibliography

! Many examples are extracted and adapted from

" Interactive Data Visualization: Foundations, Techniques, and Applications,

Matthew O. Ward, Georges Grinstein, Daniel Keim, 2015

" Visualization Analysis & Design,

Tamara Munzner, 2015

Data Foundations - 3
Table of Contents

! Introduction

! Data by Matthew O. Ward, et all

! Data by Tamara Munzner

! Structure within and between records

! Data Preprocessing

Data Foundations - 4
Interactive Data Visualization

Some practical Information

Data Foundations - 5
Evaluation rules

! Two mid-term written individual tests (25% each)

! One project (for team of 3 students), with several phases:

! Specification

! Paper (20%)

! Code/implementation (30%)

! (*) an oral discussion will be required to validate the project components

! Course approval requires the following minimal grades:

! (mean (Test1; Test2) >= 10) AND (Test1 >= 8) AND (Test2 >= 8)

! (mean(Paper;Code&Implementation) >= 10) AND

! Final exam may replace mean (Test1; Test2) if project is approved.

Data Foundations - 6
Important dates

! Team registration - Mars 20th

! Select datasets for your project - Mars 25 th - April 24th

" Discuss in the lab sessions the viability

" Evaluate de selected datasets

" Define and get an approval of your research questions

" Make a state of the art

! Paper - May 15th

Data Foundations - 7
Team Registration

! Access the shared google sheet

" Fill 3 students on one available slot. Only on the yellow cells.

Data Foundations - 8
Team Registration

! Access the shared google sheet

" Fill 3 students on one available slot. Only on the yellow cells.

! You will receive (later) access to a shared folder for the team: VID-19-20-GNN

" Use this folder to share the information inside the group

" And with the teacher

Data Foundations - 8
Team Registration

! Access the shared google sheet

" Fill 3 students on one available slot. Only on the yellow cells.

! You will receive (later) access to a shared folder for the team: VID-19-20-GNN

" Use this folder to share the information inside the group

" And with the teacher

! You will receive (later) an invite for the Tableau online

Data Foundations - 8
Interactive Data Visualization

Recap from previous lecture

Data Foundations - 9
What is the Goal of Data Visualization?

“Data visualization is not just about seeing data !

Is about UNDERSTANDING data,

and being able to make decisions based on the data”

by John C. Hart

Introduction to Data Visualization - 10


What is the Goal of Data Visualization?

D V
f
a lo
go
e )
a t
tl im
(u “Data visualization is not just about seeing data !
he
T
Is about UNDERSTANDING data,

and being able to make decisions based on the data”

by John C. Hart

Introduction to Data Visualization - 10


What is the core idea of Interactive Data Visualization?

Introduction to Data Visualization - 11


What is the core idea of Interactive Data Visualization?

Map
ping
to V data
isua
l Va
riab
les

Introduction to Data Visualization - 11


What is the core idea of Interactive Data Visualization?

Map
ping
to V data
isua
l Va
riab
les

Question(s) / Task

Introduction to Data Visualization - 11


What is the core idea of Interactive Data Visualization?

Map
ping
to V data
isua
l Va
riab
les

Interactivity

Question(s) / Task

Introduction to Data Visualization - 11


What you should know
! What is Data Visualization.

Data Foundations - 12
What you should know
! What is Data Visualization.

" Understanding the data => take decisions

Data Foundations - 12
What you should know
! What is Data Visualization.

" Understanding the data => take decisions

! Data Visualization can be extremely powerful

Data Foundations - 12
What you should know
! What is Data Visualization.

" Understanding the data => take decisions

! Data Visualization can be extremely powerful

" Uncover new patterns; confirm hypothesis;

Data Foundations - 12
What you should know
! What is Data Visualization.

" Understanding the data => take decisions

! Data Visualization can be extremely powerful

" Uncover new patterns; confirm hypothesis;

! Why Visualization is important.

Data Foundations - 12
What you should know
! What is Data Visualization.

" Understanding the data => take decisions

! Data Visualization can be extremely powerful

" Uncover new patterns; confirm hypothesis;

! Why Visualization is important.

" Stats not enough; communication needs; exploratory needs

Data Foundations - 12
What you should know
! What is Data Visualization.

" Understanding the data => take decisions

! Data Visualization can be extremely powerful

" Uncover new patterns; confirm hypothesis;

! Why Visualization is important.

" Stats not enough; communication needs; exploratory needs

! Key aspects of today Visualizations.

Data Foundations - 12
What you should know
! What is Data Visualization.

" Understanding the data => take decisions

! Data Visualization can be extremely powerful

" Uncover new patterns; confirm hypothesis;

! Why Visualization is important.

" Stats not enough; communication needs; exploratory needs

! Key aspects of today Visualizations.

" Interactions; visual abstractions; multiple (linked) visualizations.

Data Foundations - 12
What you should know
! What is Data Visualization.

" Understanding the data => take decisions

! Data Visualization can be extremely powerful

" Uncover new patterns; confirm hypothesis;

! Why Visualization is important.

" Stats not enough; communication needs; exploratory needs

! Key aspects of today Visualizations.

" Interactions; visual abstractions; multiple (linked) visualizations.

! The general steps of a Visualization Process

Data Foundations - 12
What you should know
! What is Data Visualization.

" Understanding the data => take decisions

! Data Visualization can be extremely powerful

" Uncover new patterns; confirm hypothesis;

! Why Visualization is important.

" Stats not enough; communication needs; exploratory needs

! Key aspects of today Visualizations.

" Interactions; visual abstractions; multiple (linked) visualizations.

! The general steps of a Visualization Process

" Raw data -> data -> viz structures -> images -> perception + feedback

Data Foundations - 12
What you should know
! What is Data Visualization.

" Understanding the data => take decisions

! Data Visualization can be extremely powerful

" Uncover new patterns; confirm hypothesis;

! Why Visualization is important.

" Stats not enough; communication needs; exploratory needs

! Key aspects of today Visualizations.

" Interactions; visual abstractions; multiple (linked) visualizations.

! The general steps of a Visualization Process

" Raw data -> data -> viz structures -> images -> perception + feedback

! The role of Perception.

Data Foundations - 12
What you should know
! What is Data Visualization.

" Understanding the data => take decisions

! Data Visualization can be extremely powerful

" Uncover new patterns; confirm hypothesis;

! Why Visualization is important.

" Stats not enough; communication needs; exploratory needs

! Key aspects of today Visualizations.

" Interactions; visual abstractions; multiple (linked) visualizations.

! The general steps of a Visualization Process

" Raw data -> data -> viz structures -> images -> perception + feedback

! The role of Perception.

" The role and the importance of the user.

Data Foundations - 12
Interactive Data Visualization

Introduction to Data Foundations

Data Foundations - 13
Visualization Process: visualization pipeline

! For visualization the stages are:


! Modeling: the data to be visualized
! Data Selection: similar to clipping
! Data to visual mappings: the heart of the visualization is mapping data values to
graphical entities or their attributes; may involve scaling, shifting, filtering,
interpolating, or subsampling.
! Scene parameter setting: (ex: color mapping)
! Rendering or generation of the visualization

Data Foundations - 14
Data: Sources

! Sources

" Sensors;

" Surveys;

" Simulations;

" Computations;

" Log of human and machine activity

Data Foundations - 15
Data: Sources

! Sources

" Sensors;

" Surveys;

" Simulations;

" Computations;

" Log of human and machine activity

! Raw versus Processed data

" Raw data (untreated)

" Processed: smoothing, noise removal, scaling, interpolation, aggregation

Data Foundations - 15
Data: typical data set in visualization

Data Foundations - 16
Data: typical data set in visualization

! List of n records

! (r1, r2, …, rn )

! a record ri consists in m (one or more) observations or variables

( v1, v2, …, vm )

Data Foundations - 16
Data: typical data set in visualization

! List of n records

! (r1, r2, …, rn )

! a record ri consists in m (one or more) observations or variables

( v1, v2, …, vm )

! one observation may be:

• a single number / symbol / string

• a more complex structure

Data Foundations - 16
Data: typical data set in visualization

! List of n records

! (r1, r2, …, rn )

! a record ri consists in m (one or more) observations or variables

( v1, v2, …, vm )

! one observation may be:

• a single number / symbol / string

• a more complex structure

! A variable may be classified as:

• independent: whose value is not controlled or affected by another variable

• dependent: whose value is affected by the variation in one or more associated

independent variables

Data Foundations - 16
Data: typical data set in visualization

! A record r consists in mi independent variables and md dependent variables

r = ( iv1, iv2, …, ivmi , dv1, dv2, …, dvmd )

Data Foundations - 17
Data: typical data set in visualization

! A record r consists in mi independent variables and md dependent variables

r = ( iv1, iv2, …, ivmi , dv1, dv2, …, dvmd )

! We may not know which variables are dependent and which are independent.

Data Foundations - 17
Data: typical data set in visualization

! A record r consists in mi independent variables and md dependent variables

r = ( iv1, iv2, …, ivmi , dv1, dv2, …, dvmd )

! We may not know which variables are dependent and which are independent.

! In general a data set will not contain an exhaustive list of all possible

combinations of values for the independent variables

Data Foundations - 17
Data: typical data set in visualization

! A record r consists in mi independent variables and md dependent variables

r = ( iv1, iv2, …, ivmi , dv1, dv2, …, dvmd )

! We may not know which variables are dependent and which are independent.

! In general a data set will not contain an exhaustive list of all possible

combinations of values for the independent variables

! A data set can be seen as a function

Domain of Independent variables F Range of dependent variables

Data Foundations - 17
Interactive Data Visualization

Data
(Matthew O. Ward, et all)

Data Foundations - 18
Interactive Data Visualization

Data Types

Data Foundations - 19
Types of data. Numeric versus Non-Numeric

! In its simplest form each variable of a record has a single piece of

information (scalar values)

Data Foundations - 20
Types of data. Numeric versus Non-Numeric

! In its simplest form each variable of a record has a single piece of

information (scalar values)

! Numeric (ordinal):

! binary: assuming only the values 0 and 1;

! discrete: integer values or from a specific subset (e.g., (2, 4, 6, 8, 10);

! continuous: representing real values (e.g., [0, 100]).

Data Foundations - 20
Types of data. Numeric versus Non-Numeric

! In its simplest form each variable of a record has a single piece of

information (scalar values)

! Numeric (ordinal):

! binary: assuming only the values 0 and 1;

! discrete: integer values or from a specific subset (e.g., (2, 4, 6, 8, 10);

! continuous: representing real values (e.g., [0, 100]).

! Non Numeric (nominal):

! categorial: finite (normally short) list of values (e.g., red, green, blue);

! ranked: a categorial variable that has an implied order (e.g., small, medium, large);

! arbitrary: potentially infinite range of values (e.g., names, addresses).

Data Foundations - 20
Types of data. Type of scale

! Properties of scales of measurement:

Data Foundations - 21
Types of data. Type of scale

! Properties of scales of measurement:

! Identity. Each value on the measurement scale has a unique meaning.

Data Foundations - 21
Types of data. Type of scale

! Properties of scales of measurement:

! Identity. Each value on the measurement scale has a unique meaning.

! Magnitude. Values on the measurement scale have an ordered relationship to one

another. That is, some values are larger and some are smaller.

Data Foundations - 21
Types of data. Type of scale

! Properties of scales of measurement:

! Identity. Each value on the measurement scale has a unique meaning.

! Magnitude. Values on the measurement scale have an ordered relationship to one

another. That is, some values are larger and some are smaller.

! Equal intervals. Scale units along the scale are equal to one another. This means,

for example, that the difference between 1 and 2 would be equal to the difference

between 19 and 20. This is also know as distance metric.

Data Foundations - 21
Types of data. Type of scale

! Properties of scales of measurement:

! Identity. Each value on the measurement scale has a unique meaning.

! Magnitude. Values on the measurement scale have an ordered relationship to one

another. That is, some values are larger and some are smaller.

! Equal intervals. Scale units along the scale are equal to one another. This means,

for example, that the difference between 1 and 2 would be equal to the difference

between 19 and 20. This is also know as distance metric.

! A minimum value of zero. The scale has a true zero point, below which no values

exist. When a scale has an absolute zero then it makes sense to apply all the

mathematical operations (+, -, *, /).

Data Foundations - 21
Types of data. Type of scale

Data Foundations - 22
Types of data. Type of scale

! Nominal Scale of Measurement:

! Only satisfies the identity property of measurement

! Categorial and Arbitrary(*)

Data Foundations - 22
Types of data. Type of scale

! Nominal Scale of Measurement:

! Only satisfies the identity property of measurement

! Categorial and Arbitrary(*)

! Ordinal Scale of Measurement:

" Has the property of both identity and magnitude

" Ranked (and all the numeric)

Data Foundations - 22
Types of data. Type of scale

! Nominal Scale of Measurement:

! Only satisfies the identity property of measurement

! Categorial and Arbitrary(*)

! Ordinal Scale of Measurement:

" Has the property of both identity and magnitude

" Ranked (and all the numeric)

! Interval Scale of Measurement

" Has the properties of identity, magnitude, and equal intervals.

" Discrete. e.g., Fahrenheit (or centigrade) scale to measure temperature

Data Foundations - 22
Types of data. Type of scale

! Nominal Scale of Measurement:

! Only satisfies the identity property of measurement

! Categorial and Arbitrary(*)

! Ordinal Scale of Measurement:

" Has the property of both identity and magnitude

" Ranked (and all the numeric)

! Interval Scale of Measurement

" Has the properties of identity, magnitude, and equal intervals.

" Discrete. e.g., Fahrenheit (or centigrade) scale to measure temperature

! Ratio Scale of Measurement

" Satisfies identity, magnitude, equal intervals, and a minimum value of zero.

" Continuous. e.g., weight, distance, etc. Can apply operations of / and *.

Data Foundations - 22
Interactive Data Visualization

Structure within and between records

Data Foundations - 23
Data sets structure

! The structure of a data set defines:

Data Foundations - 24
Data sets structure

! The structure of a data set defines:

! Syntactical rules

Data Foundations - 24
Data sets structure

! The structure of a data set defines:

! Syntactical rules

! The relationships between the components within a record

Data Foundations - 24
Data sets structure

! The structure of a data set defines:

! Syntactical rules

! The relationships between the components within a record

! The relationship between records

Data Foundations - 24
Scalar, Vector and Tensor

! Scalar: individual value in a data record.

! e.g.: Age; Color; Weight

More info about tensors -> https://www.youtube.com/watch?v=fu-eMNi_aag


Data Foundations - 25
Scalar, Vector and Tensor

! Scalar: individual value in a data record.

! e.g.: Age; Color; Weight

! Vector: multiple variables in a single record can represent a single item

! e.g.: Position coordinates (2D or 3D); Color using RGB(Red, Green, Blue)

components, Phone number (Country code, area code and local number), etc.

! each component (of the vector) can be considered individually but is most

appropriate to treat the vector as a whole.

More info about tensors -> https://www.youtube.com/watch?v=fu-eMNi_aag


Data Foundations - 25
Scalar, Vector and Tensor

! Scalar: individual value in a data record.

! e.g.: Age; Color; Weight

! Vector: multiple variables in a single record can represent a single item

! e.g.: Position coordinates (2D or 3D); Color using RGB(Red, Green, Blue)

components, Phone number (Country code, area code and local number), etc.

! each component (of the vector) can be considered individually but is most

appropriate to treat the vector as a whole.

! Tensor: a tensor is defined by its rank and its dimensionality. A scalar is a tensor of

rank 0; a vector with D components is a tensor of rank 1 and D dimensionality. A

tensor of rank 2 and 3 dimensions can be represented as a Matrix 3 x 3.

More info about tensors -> https://www.youtube.com/watch?v=fu-eMNi_aag


Data Foundations - 25
Geometry and Grids

! Geometry via explicit coordinates for each record in the data set.

Data Foundations - 26
Geometry and Grids

! Geometry via explicit coordinates for each record in the data set.

! Data set about fires in Portugal. Associated to each fire a coordinate of the

starting point;

Data Foundations - 26
Geometry and Grids

! Geometry via explicit coordinates for each record in the data set.

! Data set about fires in Portugal. Associated to each fire a coordinate of the

starting point;

! Data set about temperature readings from sensors and associated with all the

information sensor’s coordinates.

Data Foundations - 26
Geometry and Grids

! Geometry via explicit coordinates for each record in the data set.

! Data set about fires in Portugal. Associated to each fire a coordinate of the

starting point;

! Data set about temperature readings from sensors and associated with all the

information sensor’s coordinates.

! Data set describing 3D world. The geometry concept is the majority of the data.

Data Foundations - 26
Geometry and Grids

! Geometry via explicit coordinates for each record in the data set.

! Data set about fires in Portugal. Associated to each fire a coordinate of the

starting point;

! Data set about temperature readings from sensors and associated with all the

information sensor’s coordinates.

! Data set describing 3D world. The geometry concept is the majority of the data.

! Census data set which associates the data to administrative regions

Data Foundations - 26
Geometry and Grids

! Geometry via explicit coordinates for each record in the data set.

! Data set about fires in Portugal. Associated to each fire a coordinate of the

starting point;

! Data set about temperature readings from sensors and associated with all the

information sensor’s coordinates.

! Data set describing 3D world. The geometry concept is the majority of the data.

! Census data set which associates the data to administrative regions

! Geometric structure is implied and it is assumed some form of grid. Successive data

records are located at successive positions. It requires to set the starting point, the

directions and the step size for each dimension.

Data Foundations - 26
Geometry and Grids

! Geometry via explicit coordinates for each record in the data set.

! Data set about fires in Portugal. Associated to each fire a coordinate of the

starting point;

! Data set about temperature readings from sensors and associated with all the

information sensor’s coordinates.

! Data set describing 3D world. The geometry concept is the majority of the data.

! Census data set which associates the data to administrative regions

! Geometric structure is implied and it is assumed some form of grid. Successive data

records are located at successive positions. It requires to set the starting point, the

directions and the step size for each dimension.

! Satellite images.

Data Foundations - 26
Other forms of structure

! Time

! Present in many data sets

! Uniformly spaced versus non-uniformly spaced

! Relative versus absolute

! Local versus Universal time

! Seen as linear versus as cyclic

Data Foundations - 27
Other forms of structure

! Time http://www.timeviz.net
! Present in many data sets check to see so many
visualization techniques for
! Uniformly spaced versus non-uniformly spaced Time-Oriented Data
! Relative versus absolute

! Local versus Universal time

! Seen as linear versus as cyclic

Data Foundations - 27
Other forms of structure

! Time http://www.timeviz.net
! Present in many data sets check to see so many
visualization techniques for
! Uniformly spaced versus non-uniformly spaced Time-Oriented Data
! Relative versus absolute

! Local versus Universal time

! Seen as linear versus as cyclic

! Topology

! How the records are connected.

! Geometry and space (spatial neighbors)

! Hierarchy and graphs

! This form of structure can be explicitly included in the data record or as an auxiliary data

structure

Data Foundations - 27
Examples

Interactive Data Visualization: Foundations,


Techniques, and Applications, Matthew O.
Ward, Georges Grinstein, Daniel Keim, 2015

Data Foundations - 28
Interactive Data Visualization

Data
(Tamara Munzner)

Data Foundations - 29
items, attributes, links, positions, and grids. An attribute is some are variab
specific property that can be measured, observed, or logged.! For mension,
Data Types
example, and Dataset
attributes could Types
be salary, price, number of sales, pro- sion for sh
tein expression levels, or temperature. An item is an individual sion has m
entity that is discrete, such as a row in a simple table or a node this book
! Data Types the visual
tial positio
Data Types Section 6.

Items Attributes Links Positions Grids

Figure 2.2. The five basic data types: items, attributes, links, positions, and grids.

Data Foundations - 30
items, attributes, links, positions, and grids. An attribute is some are variab
specific property that can be measured, observed, or logged.! For mension,
Data Types
example, and Dataset
attributes could Types
be salary, price, number of sales, pro- sion for sh
tein expression levels, or temperature. An item is an individual sion has m
entity that is discrete, such as a row in a simple table or a node this book
! Data Types the visual
tial positio
Data Types Section 6.

Items Attributes Links Positions Grids

" An item is an individual entity that is discrete, such as a row in a simple table or a node
Figure 2.2. The five basic data types: items, attributes, links, positions, and grids.
in a network

Data Foundations - 30
items, attributes, links, positions, and grids. An attribute is some are variab
specific property that can be measured, observed, or logged.! For mension,
Data Types
example, and Dataset
attributes could Types
be salary, price, number of sales, pro- sion for sh
tein expression levels, or temperature. An item is an individual sion has m
entity that is discrete, such as a row in a simple table or a node this book
! Data Types the visual
tial positio
Data Types Section 6.

Items Attributes Links Positions Grids

" An item is an individual entity that is discrete, such as a row in a simple table or a node
Figure 2.2. The five basic data types: items, attributes, links, positions, and grids.
in a network

" An attribute is some specific property that can be measured, observed, or logged.⋆

Data Foundations - 30
items, attributes, links, positions, and grids. An attribute is some are variab
specific property that can be measured, observed, or logged.! For mension,
Data Types
example, and Dataset
attributes could Types
be salary, price, number of sales, pro- sion for sh
tein expression levels, or temperature. An item is an individual sion has m
entity that is discrete, such as a row in a simple table or a node this book
! Data Types the visual
tial positio
Data Types Section 6.

Items Attributes Links Positions Grids

" An item is an individual entity that is discrete, such as a row in a simple table or a node
Figure 2.2. The five basic data types: items, attributes, links, positions, and grids.
in a network

" An attribute is some specific property that can be measured, observed, or logged.⋆

" A link is a relationship between items, typically within a network.

Data Foundations - 30
items, attributes, links, positions, and grids. An attribute is some are variab
specific property that can be measured, observed, or logged.! For mension,
Data Types
example, and Dataset
attributes could Types
be salary, price, number of sales, pro- sion for sh
tein expression levels, or temperature. An item is an individual sion has m
entity that is discrete, such as a row in a simple table or a node this book
! Data Types the visual
tial positio
Data Types Section 6.

Items Attributes Links Positions Grids

" An item is an individual entity that is discrete, such as a row in a simple table or a node
Figure 2.2. The five basic data types: items, attributes, links, positions, and grids.
in a network

" An attribute is some specific property that can be measured, observed, or logged.⋆

" A link is a relationship between items, typically within a network.

" A position is spatial data, providing a location in two-dimensional (2D) or three-

dimensional (3D) space.

Data Foundations - 30
items, attributes, links, positions, and grids. An attribute is some are variab
specific property that can be measured, observed, or logged.! For mension,
Data Types
example, and Dataset
attributes could Types
be salary, price, number of sales, pro- sion for sh
tein expression levels, or temperature. An item is an individual sion has m
entity that is discrete, such as a row in a simple table or a node this book
! Data Types the visual
tial positio
Data Types Section 6.

Items Attributes Links Positions Grids

" An item is an individual entity that is discrete, such as a row in a simple table or a node
Figure 2.2. The five basic data types: items, attributes, links, positions, and grids.
in a network

" An attribute is some specific property that can be measured, observed, or logged.⋆

" A link is a relationship between items, typically within a network.

" A position is spatial data, providing a location in two-dimensional (2D) or three-

dimensional (3D) space.

" A grid specifies the strategy for sampling continuous data in terms of both geometric

and topological relationships between its cells

Data Foundations - 30
Figure 2.4 shows the internal structure of the four basic dataset
types in detail. Tables have cells indexed by items and attributes,
for either the simple flat case or the more complex multidimen-
Data Types and Dataset Types
sional case. In a network, items are usually called nodes, and
they are connected with links; a special case of networks is trees.
Continuous fields have grids based on spatial positions where cells
! Datasetcontain
Types attributes. Spatial geometry has only position information.

" A dataset is any collection of information that is the target of analysis.

Data and Dataset Types


Tables Networks & Fields Geometry Clusters,
Trees Sets, Lists
Items Items (nodes) Grids Items Items
Attributes Links Positions Positions
Attributes Attributes

Figure 2.3. The four basic dataset types are tables, networks, fields, and geome-
try; other possible collections of items are clusters, sets, and lists. These datasets
are made up of five core data types: items, attributes, links, positions, and grids.

Data Foundations - 31
Figure 2.4 shows the internal structure of the four basic dataset
types in detail. Tables have cells indexed by items and attributes,
for either the simple flat case or the more complex multidimen-
Data Types and Dataset Types
sional case. In a network, items are usually called nodes, and
they are connected with links; a special case of networks is trees.
Continuous fields have grids based on spatial positions where cells
! Datasetcontain
Types attributes. Spatial geometry has only position information.

" A dataset is any collection of information that is the target of analysis.

Data and Dataset Types


Tables Networks & Fields Geometry Clusters,
Trees Sets, Lists
Items Items (nodes) Grids Items Items
Attributes Links Positions Positions
Attributes Attributes

Figure
" Other ways2.3. The four
to group basic
items dataset include
together types areclusters,
tables, networks, fields,
sets, and and geome-
lists.
try; other possible collections of items are clusters, sets, and lists. These datasets
are made up of five core data types: items, attributes, links, positions, and grids.

Data Foundations - 31
Figure 2.4 shows the internal structure of the four basic dataset
types in detail. Tables have cells indexed by items and attributes,
for either the simple flat case or the more complex multidimen-
Data Types and Dataset Types
sional case. In a network, items are usually called nodes, and
they are connected with links; a special case of networks is trees.
Continuous fields have grids based on spatial positions where cells
! Datasetcontain
Types attributes. Spatial geometry has only position information.

" A dataset is any collection of information that is the target of analysis.

Data and Dataset Types


Tables Networks & Fields Geometry Clusters,
Trees Sets, Lists
Items Items (nodes) Grids Items Items
Attributes Links Positions Positions
Attributes Attributes

Figure
" Other ways2.3. The four
to group basic
items dataset include
together types areclusters,
tables, networks, fields,
sets, and and geome-
lists.
try; other possible collections of items are clusters, sets, and lists. These datasets
are madesituations,
" In real-world up of five core data types:
complex items, attributes,
combinations of theselinks, positions,
basic andcommon.
types are grids.

Data Foundations - 31
Data Types and Dataset Types
2.4. Dataset Types 25

Dataset Types
Tables Networks Fields (Continuous) Geometry (Spatial)
Attributes (columns) Grid of positions

Items Link
Cell
(rows) Position
Node
(item)
Cell containing value Attributes (columns)

Value in cell

Multidimensional Table Trees

Value in cell

Figure 2.4. The detailed structure of the four basic dataset types.

2.4.1 Tables Data Foundations - 32


2.4. Dataset Types

Dataset Types:
26
Table 2. What: Data Abstraction
Dataset Types
Tables Networks Fields (Continuous) Geometry (Sp
Attributes (columns) Grid of positions

Items Link attribute


Field
Cell
(rows) Po
Node
(item)
Cell containing value Attributes (columns)

item cell
20 Value in cell

Multidimensional Table Trees

Value in cell

Figure 2.5. In a simple table of orders, a row represents an item, a column rep-
Figure 2.4. The detailed structure of the four basic dataset types.
resents an attribute, and their intersection is the cell containing the value for that
pairwise combination.
! A synonym for networks
is graphs. The word graph
is also deeply overloaded in
2.4.1 Tables vis. Sometimes it is used
to mean network as we dis- 2.4.2 Networks and Trees Data Foundations - 33
cuss here, for instance in
Many datasets come in the form The tablestype
of dataset that are made
of networks is well up offor specifying that there
suited
2.4. Dataset Types

Dataset Types:
26
Table 2. What: Data Abstraction
Dataset Types
Tables Networks Fields (Continuous) Geometry (Sp
Attributes (columns) Grid of positions

Items Link attribute


Field
Cell
(rows) Po
Node
(item)
Cell containing value Attributes (columns)

item cell
20 Value in cell

Multidimensional Table Trees

Value in cell

Figure 2.5. In a simple table of orders, a row represents an item, a column rep-
A multidimensionalFigure 2.4.a The detailed
table has structure of the four basic dataset types.
resents an attribute, and their intersection is the cell containing the value for that
more complex!structure for indexing pairwise combination.
A synonym for networks
into a cell, withis multiple keys.
graphs. The word graph
is also deeply overloaded in
2.4.1 Tablesvis. Sometimes it is used
to mean network as we dis- 2.4.2 Networks and Trees Data Foundations - 33
cuss here, for instance in
Many datasets come in the form The tablestype
of dataset that are made
of networks is well up offor specifying that there
suited
Data Types and Dataset Types

Networks Fields (Continuous) Geometry (Spatial)


s) Grid of positions

Link
Cell
Position
Node
(item)
ue Attributes (columns)

Value in cell

ble Trees

n cell

e 2.4. The detailed structure of the four basic dataset types.


Data Foundations - 34
Data Types and Dataset Types

Networks Fields (Continuous) Geometry (Spatial)


s) Grid of positions

Link
Cell
Position
Node
(item)
ue Attributes (columns)

Value in cell

ble Trees The field dataset type also contains attribute


values associated with cells.
Each cell in a field contains measurements or
calculations from a continuous domain
n cell Continuous data requires careful treatment that
takes into account the mathematical questions of
sampling data interpolation

e 2.4. The detailed structure of the four basic dataset types.


Data Foundations - 34
Data Types and Dataset Types

Networks Fields (Continuous) Geometry (Spatial)


s) Grid of positions

Link
Cell
Position
Node
(item)
ue Attributes (columns)

Value in cell

ble Trees The field dataset type also contains attribute


values associated with cells.
Each cell in a field contains measurements or
calculations from a continuous domain
n cell Continuous data requires careful treatment that
takes into account the mathematical questions of
sampling data interpolation

scientific visualization
e 2.4. The detailed structure of the four basic dataset types.
Data Foundations - 34
Data Types and Dataset Types

Networks Fields (Continuous) Geometry (Spatial)


s) Grid of positions

Link
Cell
Position
Node
(item)
ue Attributes (columns)

Value in cell

ble Trees

The problem of how to create images from a geometric description


of a scene falls into another domain: computer graphics.
n cell Simply showing a geometric dataset is not an interesting problem from
the point of view of a vis designer.

e 2.4. The detailed structure of the four basic dataset types.


Data Foundations - 35
Attribute Types 2. What: Data Abstraction

Attributes

Attribute Types
Categorical Ordered
Ordinal Quantitative

Ordering Direction

Sequential Diverging Cyclic

Data Foundations - 36
Figure 2.7. Attribute types are categorical, ordinal, or quantitative. The direction
Attribute Types 2. What: Data Abstraction

Attributes

Attribute Types
Categorical Ordered
Ordinal Quantitative

Ordering Direction

Sequential Diverging Cyclic

Data Foundations - 37
Figure 2.7. Attribute types are categorical, ordinal, or quantitative. The direction
What?
Datasets Attributes

Data Types Attribute Types


Items Attributes Links Positions Grids Categorical

Data and Dataset Types


Tables Networks & Fields Geometry Clusters, Ordered
Trees Sets, Lists Ordinal
Items Items (nodes) Grids Items Items
Attributes Links Positions Positions
Attributes Attributes Quantitative

Dataset Types
Ordering Direction
Tables Networks Fields (Continuous)
Sequential
Attributes (columns) Grid of positions

Items Link
Cell
(rows)
Node
Diverging
(item)
Cell containing value Attributes (columns)

Value in cell

Multidimensional Table Trees


Cyclic

Tamara Munzner
Value in cell

Geometry (Spatial)

Position

Dataset Availability What?

Static Dynamic Why?

How?
Interactive Data Visualization

Data Preprocessing

Data Foundations - 39
Data Preprocessing

! Metadata

Data Foundations - 40
Data Preprocessing

! Metadata

! Basic statistics about the (scalar) data

Data Foundations - 40
Data Preprocessing

! Metadata

! Basic statistics about the (scalar) data

! Missing Values and Data Cleansing

Data Foundations - 40
Data Preprocessing

! Metadata

! Basic statistics about the (scalar) data

! Missing Values and Data Cleansing

! Normalization

Data Foundations - 40
Data Preprocessing

! Metadata

! Basic statistics about the (scalar) data

! Missing Values and Data Cleansing

! Normalization

! Dimension reduction

Data Foundations - 40
Data Preprocessing

! Metadata

! Basic statistics about the (scalar) data

! Missing Values and Data Cleansing

! Normalization

! Dimension reduction

! Mapping Nominal Dimensions to Numbers

Data Foundations - 40
Data Preprocessing

! Metadata

! Basic statistics about the (scalar) data

! Missing Values and Data Cleansing

! Normalization

! Dimension reduction

! Mapping Nominal Dimensions to Numbers

! Other data processing topics

Data Foundations - 40
Metadata

! Sample from the cars data set

Data Foundations - 41
Metadata

! Sample from the cars data set

! With the exception of first column (Vehicle name) we need more information!

Data Foundations - 41
Metadata

! Sample from the cars data set

! With the exception of first column (Vehicle name) we need more information!

Data Foundations - 41
Metadata

! Sample from the cars data set

! With the exception of first column (Vehicle name) we need more information!

! With the column names it is much better but it is not enough !

Data Foundations - 41
Metadata

! Associated Metadata

Data Foundations - 42
Metadata

! Associated Metadata

+ Extended variable names and their meaning

Data Foundations - 42
Metadata

! Associated Metadata

+ Extended variable names and their meaning


+ Used units

Data Foundations - 42
Metadata

! Associated Metadata

+ Extended variable names and their meaning


+ Used units
+ Special values

Data Foundations - 42
Metadata

! Associated Metadata

+ Extended variable names and their meaning


+ Used units
+ Special values
+ How to denote missing values

Data Foundations - 42
Metadata

! Metadata provides:

" Source of data

" Information that facilitates the interpretation of the data set

" Units

" Symbol to indicate a missing value

" Reference point for some measurements

" Resolution at which the measurements were acquired

Data Foundations - 43
Basic statistics about the (scalar) data

! For simple data types (scalars)

Data Foundations - 44
Basic statistics about the (scalar) data

! For simple data types (scalars)

! All data types

Data Foundations - 44
Basic statistics about the (scalar) data

! For simple data types (scalars)

! All data types

" Number of missing values

Data Foundations - 44
Basic statistics about the (scalar) data

! For simple data types (scalars)

! All data types

" Number of missing values

! Excluding the non-numeric arbitrary (names, address, etc)

Data Foundations - 44
Basic statistics about the (scalar) data

! For simple data types (scalars)

! All data types

" Number of missing values

! Excluding the non-numeric arbitrary (names, address, etc)

" Number of values out of range (if the range of variable is provided)

Data Foundations - 44
Basic statistics about the (scalar) data

! For simple data types (scalars)

! All data types

" Number of missing values

! Excluding the non-numeric arbitrary (names, address, etc)

" Number of values out of range (if the range of variable is provided)

! For non-continuous values

Data Foundations - 44
Basic statistics about the (scalar) data

! For simple data types (scalars)

! All data types

" Number of missing values

! Excluding the non-numeric arbitrary (names, address, etc)

" Number of values out of range (if the range of variable is provided)

! For non-continuous values

" Frequency distribution

Data Foundations - 44
Basic statistics about the (scalar) data

! For simple data types (scalars)

! All data types

" Number of missing values

! Excluding the non-numeric arbitrary (names, address, etc)

" Number of values out of range (if the range of variable is provided)

! For non-continuous values

" Frequency distribution

" Mode

Data Foundations - 44
Basic statistics about the (scalar) data

! For simple data types (scalars)

! All data types

" Number of missing values

! Excluding the non-numeric arbitrary (names, address, etc)

" Number of values out of range (if the range of variable is provided)

! For non-continuous values

" Frequency distribution

" Mode

! For numeric variables

Data Foundations - 44
Basic statistics about the (scalar) data

! For simple data types (scalars)

! All data types

" Number of missing values

! Excluding the non-numeric arbitrary (names, address, etc)

" Number of values out of range (if the range of variable is provided)

! For non-continuous values

" Frequency distribution

" Mode

! For numeric variables

" Mean, Variance, etc.

Data Foundations - 44
Basic statistics about the (scalar) data

! Categorial variable (from Cars data set): Class


Stats:
- mode
Absolute Relative
- domain cardinality
Frequency Frequency
distribution distribution

Data Foundations - 45
Basic statistics about the (scalar) data

! Numeric (continuous) variable (from Cars data set): Engine Size

Data Foundations - 46
Statistics techniques for getting additional insights

! Outlier detection

! “In statistics, an outlier is an observation point that is distant from other

observations. An outlier may be due to variability in the measurement or it may

indicate experimental error; the latter are sometimes excluded from the data set.!”
https://en.wikipedia.org/wiki/Outlier
https://www.siam.org/meetings/sdm10/tutorial3.pdf

Data Foundations - 47
Statistics techniques for getting additional insights

! Outlier detection

! “In statistics, an outlier is an observation point that is distant from other

observations. An outlier may be due to variability in the measurement or it may

indicate experimental error; the latter are sometimes excluded from the data set.!”
https://en.wikipedia.org/wiki/Outlier
https://www.siam.org/meetings/sdm10/tutorial3.pdf
! Cluster Analysis

! Can help segment the data into groups with strong similarities
https://en.wikipedia.org/wiki/Cluster_analysis

Data Foundations - 47
Statistics techniques for getting additional insights

! Outlier detection

! “In statistics, an outlier is an observation point that is distant from other

observations. An outlier may be due to variability in the measurement or it may

indicate experimental error; the latter are sometimes excluded from the data set.!”
https://en.wikipedia.org/wiki/Outlier
https://www.siam.org/meetings/sdm10/tutorial3.pdf
! Cluster Analysis

! Can help segment the data into groups with strong similarities
https://en.wikipedia.org/wiki/Cluster_analysis

! Correlation Analysis

! can help users to eliminate variables (because are redundant or highlight)

Data Foundations - 47
Statistics techniques for getting additional insights

! Correlation Analysis

Data Foundations - 48
Missing Values and Data Cleansing

! Missing data:

Data Foundations - 49
Missing Values and Data Cleansing

! Missing data:

! malfunctioning sensor; blank entry on a survey; omission on a person entering

the data; etc..

Data Foundations - 49
Missing Values and Data Cleansing

! Missing data:

! malfunctioning sensor; blank entry on a survey; omission on a person entering

the data; etc..

! It is necessary to define a strategy to deal with missing data. It should depend on

the application domain, the number of missing values, the quality of the other

variables.

Data Foundations - 49
Missing Values and Data Cleansing

! Missing data:

! malfunctioning sensor; blank entry on a survey; omission on a person entering

the data; etc..

! It is necessary to define a strategy to deal with missing data. It should depend on

the application domain, the number of missing values, the quality of the other

variables.

! Erroneous data

Data Foundations - 49
Missing Values and Data Cleansing

! Missing data:

! malfunctioning sensor; blank entry on a survey; omission on a person entering

the data; etc..

! It is necessary to define a strategy to deal with missing data. It should depend on

the application domain, the number of missing values, the quality of the other

variables.

! Erroneous data

! human error; malfunctioning sensor, etc..

Data Foundations - 49
Missing Values and Data Cleansing

! Missing data:

! malfunctioning sensor; blank entry on a survey; omission on a person entering

the data; etc..

! It is necessary to define a strategy to deal with missing data. It should depend on

the application domain, the number of missing values, the quality of the other

variables.

! Erroneous data

! human error; malfunctioning sensor, etc..

! May be very hard to detect unless they are out of range values or obvious outlier.

Data Foundations - 49
Missing Values

! Discard the bad record

! Is the most commonly applied; It implies a loss of information that should be

evaluated. Sometimes the records with missing values are the most interesting to

be analyzed.

Data Foundations - 50
Missing Values

! Discard the bad record

! Is the most commonly applied; It implies a loss of information that should be

evaluated. Sometimes the records with missing values are the most interesting to

be analyzed.

! Assign a sentinel value

! Assign a sentinel value for each variable when the real value is in question

(missing or erroneous). This value should be carefully considered in the

processing.

Data Foundations - 50
Missing Values

! Discard the bad record

! Is the most commonly applied; It implies a loss of information that should be

evaluated. Sometimes the records with missing values are the most interesting to

be analyzed.

! Assign a sentinel value

! Assign a sentinel value for each variable when the real value is in question

(missing or erroneous). This value should be carefully considered in the

processing.

! Assign the average value

! Average value for that variable; Minimally affects the statistics of that variable;

The average may not be a good guess; It may mask outliers.

Data Foundations - 50
Missing Values and Data Cleansing

Data Foundations - 51
Missing Values and Data Cleansing

! Assign value based on nearest neighbor

! Try to find the (missing) value for one variable i for one particular record based on the

value(s) for that variable based on the records that are the most similar to this

particular record (based on the other variables). We are assuming that the variable i

depends on all other variables and may not be the case.

! When we have connectivity information (spatial or geo-spatial data, graphs) the

nearest neighbor may be considered based on the available connections.

Data Foundations - 51
Missing Values and Data Cleansing

! Assign value based on nearest neighbor

! Try to find the (missing) value for one variable i for one particular record based on the

value(s) for that variable based on the records that are the most similar to this

particular record (based on the other variables). We are assuming that the variable i

depends on all other variables and may not be the case.

! When we have connectivity information (spatial or geo-spatial data, graphs) the

nearest neighbor may be considered based on the available connections.

! Compute a substitute value

! All the previous methods are had hoc ! Some new statistical approaches propose

methods and algorithms to make multiple imputations for the missing values

! More info: ”Multiple imputation for multivariate missing-data problems: a data

analyst’s perspective", by Joseph L. Schafer and Maren K. Olsen

Data Foundations - 51
Normalization

Data Foundations - 52
Normalization

! Most normalization methods require a distance metric.

Data Foundations - 52
Normalization

! Most normalization methods require a distance metric.

! One purpose is to scale different variables to comparable range of values.

Data Foundations - 52
Normalization

! Most normalization methods require a distance metric.

! One purpose is to scale different variables to comparable range of values.

! Another objective is to redistribute the values if they are concentrated on a

small part of the available scale

Data Foundations - 52
Normalization

! Most normalization methods require a distance metric.

! One purpose is to scale different variables to comparable range of values.

! Another objective is to redistribute the values if they are concentrated on a

small part of the available scale

! Examples of normalization functions:

Data Foundations - 52
Normalization

! Most normalization methods require a distance metric.

! One purpose is to scale different variables to comparable range of values.

! Another objective is to redistribute the values if they are concentrated on a

small part of the available scale

! Examples of normalization functions:

+-./01/234+5/1
• !"#$%&'()*+ = (+527 4+5/1 )

Data Foundations - 52
Normalization

! Most normalization methods require a distance metric.

! One purpose is to scale different variables to comparable range of values.

! Another objective is to redistribute the values if they are concentrated on a

small part of the available scale

! Examples of normalization functions:

+-./01/234+5/1
• !"#$%&'()*+ = (+527 4+5/1 )

/1234567 & /835


• !"#$%&'($)*+,-./ =
( /86: & /835 )

Data Foundations - 52
Normalization

! Most normalization methods require a distance metric.

! One purpose is to scale different variables to comparable range of values.

! Another objective is to redistribute the values if they are concentrated on a

small part of the available scale

! Examples of normalization functions:

+-./01/234+5/1
• !"#$%&'()*+ = (+527 4+5/1 )

/1234567 & /835


• !"#$%&'($)*+,-./ =
( /86: & /835 )

/01 -23454678 % /01 -946


• !"#$%&#'()"*+,- = /01 -97: % /01 -946

Data Foundations - 52
Normalization

! Most normalization methods require a distance metric.

! One purpose is to scale different variables to comparable range of values.

! Another objective is to redistribute the values if they are concentrated on a

small part of the available scale

! Examples of normalization functions:

+-./01/234+5/1 *+,-./01#2
• !"#$%&'()*+ = • !"#$%&'( =
(+527 4+5/1 ) 3

/1234567 & /835


• !"#$%&'($)*+,-./ =
( /86: & /835 )

/01 -23454678 % /01 -946


• !"#$%&#'()"*+,- = /01 -97: % /01 -946

Data Foundations - 52
Normalization

! Most normalization methods require a distance metric.

! One purpose is to scale different variables to comparable range of values.

! Another objective is to redistribute the values if they are concentrated on a

small part of the available scale

! Examples of normalization functions:

+-./01/234+5/1 *+,-./01#2
• !"#$%&'()*+ = • !"#$%&'( =
(+527 4+5/1 ) 3

/1234567 & /835


• !"#$%&'($)*+,-./ = ! Replacing Min and
( /86: & /835 )
Max by ∂-Quantile
/01 -23454678 % /01 -946
• !"#$%&#'()"*+,- = and (1-∂)-Quantile
/01 -97: % /01 -946

Data Foundations - 52
Normalization

! Data from 414 cars (from 2004); Variable: City Miles Per Gallon (City MPG)
City-MPG
120

100

80

60

40

20

0
12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60

Data Foundations - 53
Normalization

! Data from 414 cars (from 2004); Variable: City Miles Per Gallon (City MPG)
City-MPG City0MPG0Norm
120 120

100 100

80 80

60 60

40 40

20 20

0
0

0.2

0.4

0.6

0.8
0.04
0.08
0.12
0.16

0.24
0.28
0.32
0.36

0.44
0.48
0.52
0.56

0.64
0.68
0.72
0.76

0.84
0.88
0.92
0.96
1
12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60

Data Foundations - 53
Normalization

! Data from 414 cars (from 2004); Variable: City Miles Per Gallon (City MPG)
City-MPG City0MPG0Norm
120 120

100 100

80 80

60 60

40 40

20 20

0
0

0.2

0.4

0.6

0.8
0.04
0.08
0.12
0.16

0.24
0.28
0.32
0.36

0.44
0.48
0.52
0.56

0.64
0.68
0.72
0.76

0.84
0.88
0.92
0.96
1
12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60

City0MPG0SQRT0Norm
120

100

80

60

40

20

0
0.2

0.4

0.6

0.8
0.04
0.08
0.12
0.16

0.24
0.28
0.32
0.36

0.44
0.48
0.52
0.56

0.64
0.68
0.72
0.76

0.84
0.88
0.92
0.96
1

Data Foundations - 53
Normalization

! Data from 414 cars (from 2004); Variable: City Miles Per Gallon (City MPG)
City-MPG City0MPG0Norm
120 120

100 100

80 80

60 60

40 40

20 20

0
0

0.2

0.4

0.6

0.8
0.04
0.08
0.12
0.16

0.24
0.28
0.32
0.36

0.44
0.48
0.52
0.56

0.64
0.68
0.72
0.76

0.84
0.88
0.92
0.96
1
12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60

City0MPG0SQRT0Norm City0MPG0LOG0Norm
120 120

100 100

80 80

60 60

40 40

20 20

0 0
0.2

0.4

0.6

0.8

0.2

0.4

0.6

0.8
0.04
0.08
0.12
0.16

0.24
0.28
0.32
0.36

0.44
0.48
0.52
0.56

0.64
0.68
0.72
0.76

0.84
0.88
0.92
0.96

0.04
0.08
0.12
0.16

0.24
0.28
0.32
0.36

0.44
0.48
0.52
0.56

0.64
0.68
0.72
0.76

0.84
0.88
0.92
0.96
1

1
Data Foundations - 53
Normalization

! Data from 414 cars (from 2004); Variable: City Miles Per Gallon (City MPG)
City-MPG
120
Normalization6Maps
100 1

80
0.9

60
0.8
40

20 0.7

0 0.6
12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60
Normalize6min;max

0.5 Normalize6SQRT

City/MPG/Z'Score Normalize6LOG

0.4 Normalize6Percentil
120

100 0.3

80
0.2

60
0.1
40

20 0
10 15 20 25 30 35 40 45 50 55 60

0
'02 '01 '01 00 00 00 01 01 02 02 02 03 03 03 04 04 05 05 05 06 06 07 07 07 08

Data Foundations - 54
Dimension reduction

! In situations where the dimensionality of the data exceeds the capabilities of

the visualization technique.


Example of Scatter Plot

Bertini DataScience showcase (2014)


Data Foundations - 55
Dimension reduction

! In situations where the dimensionality of the data exceeds the capabilities of

the visualization technique. It is necessary to investigate ways to reduce the

data dimensionality, while at the same time preserving, as much as possible,

the information contained within.

Data Foundations - 56
Dimension reduction

! In situations where the dimensionality of the data exceeds the capabilities of

the visualization technique. It is necessary to investigate ways to reduce the

data dimensionality, while at the same time preserving, as much as possible,

the information contained within.

! Principal Component Analysis (PCA) - read more

Data Foundations - 56
Dimension reduction

! In situations where the dimensionality of the data exceeds the capabilities of

the visualization technique. It is necessary to investigate ways to reduce the

data dimensionality, while at the same time preserving, as much as possible,

the information contained within.

! Principal Component Analysis (PCA) - read more

! Multidimensional Scaling (MDS) - read more and more

Data Foundations - 56
Dimension reduction

! In situations where the dimensionality of the data exceeds the capabilities of

the visualization technique. It is necessary to investigate ways to reduce the

data dimensionality, while at the same time preserving, as much as possible,

the information contained within.

! Principal Component Analysis (PCA) - read more

! Multidimensional Scaling (MDS) - read more and more

! Non-linear dimension reduction techniques:

" Self-organizing Maps (SOMs) - read more

" Local Linear Embeddings (LLE) - read more

Data Foundations - 56
Dimension reduction - Principal Component Analysis (PCA)

! PCA computes new dimensions/attributes which are linear combinations of

the original data attributes.

Data Foundations - 57
Dimension reduction - Principal Component Analysis (PCA)

! PCA computes new dimensions/attributes which are linear combinations of

the original data attributes.

! The advantage of the new dimensions is that they can be sorted according to

their contribution in explaining the variance of the data.

Data Foundations - 57
Dimension reduction - Principal Component Analysis (PCA)

! PCA computes new dimensions/attributes which are linear combinations of

the original data attributes.

! The advantage of the new dimensions is that they can be sorted according to

their contribution in explaining the variance of the data.

! By selecting the most relevant new dimensions, a subspace of variables is

obtained that minimizes the average error of lost information

Data Foundations - 57
Dimension reduction - Principal Component Analysis (PCA)

Iris flower data set

Iris versicolor
Iris setosa

Iris virginica

Data Foundations - 58
Dimension reduction - Principal Component Analysis (PCA)

! Figure 2.4 from Interactive Data Visualization: Foundations, Techniques, and Applications, Matthew O. Ward,
Georges Grinstein, Daniel Keim, 2010
Iris flower data set

4 Variables

2 Variables

Data Foundations - 59
Mapping Nominal Dimensions to Numbers

! How to visualize Nominal dimensions?

Data Foundations - 60
Mapping Nominal Dimensions to Numbers

! How to visualize Nominal dimensions?

! how many nominal dimensions exist?

Data Foundations - 60
Mapping Nominal Dimensions to Numbers

! How to visualize Nominal dimensions?

! how many nominal dimensions exist?

! how many distinct values each variable can take on?

Data Foundations - 60
Mapping Nominal Dimensions to Numbers

! How to visualize Nominal dimensions?

! how many nominal dimensions exist?

! how many distinct values each variable can take on?

! an ordering or distance relation is available or can be derived?

Data Foundations - 60
Mapping Nominal Dimensions to Numbers

! How to visualize Nominal dimensions?

! how many nominal dimensions exist?

! how many distinct values each variable can take on?

! an ordering or distance relation is available or can be derived?

! Warning:

Find a mapping of the data to a graphical entity or attribute that


doesn’t introduce artificial relationships that don’t exist in the data

Data Foundations - 60
Mapping Nominal Dimensions to Numbers

! How to visualize Nominal dimensions?

! how many nominal dimensions exist?

! how many distinct values each variable can take on?

! an ordering or distance relation is available or can be derived?

! Warning:

Find a mapping of the data to a graphical entity or attribute that


doesn’t introduce artificial relationships that don’t exist in the data
! Ranked nominal values can be mapped to numbers and so can be easily mapped to

many graphical attributes

Data Foundations - 60
Mapping Nominal Dimensions to Numbers

! How to visualize Nominal dimensions?

! how many nominal dimensions exist?

! how many distinct values each variable can take on?

! an ordering or distance relation is available or can be derived?

! Warning:

Find a mapping of the data to a graphical entity or attribute that


doesn’t introduce artificial relationships that don’t exist in the data
! Ranked nominal values can be mapped to numbers and so can be easily mapped to

many graphical attributes

! Non ranked nominal values have to be managed carefully

Data Foundations - 60
Mapping Nominal Dimensions to Numbers

! Non-ranked nominal values have to be managed carefully

! Variables with only a modest number of different values:

• map to graphical attributes like color or shape

Data Foundations - 61
Mapping Nominal Dimensions to Numbers

! Non-ranked nominal values have to be managed carefully

! Variables with only a modest number of different values:

• map to graphical attributes like color or shape

! A single nominal variable:

• Use this variable as the label for the graphical elements being displayed when

the number of records to be displayed is modest.

Data Foundations - 61
Mapping Nominal Dimensions to Numbers

! Non-ranked nominal values have to be managed carefully

! Variables with only a modest number of different values:

• map to graphical attributes like color or shape

! A single nominal variable:

• Use this variable as the label for the graphical elements being displayed when

the number of records to be displayed is modest.

• Showing random subsets of labels and changing the points with labels being

shown on a regular basis, and showing only the labels on objects near the

cursor.

Data Foundations - 61
Mapping Nominal Dimensions to Numbers

! Mapping to numbers by looking at similarities between the numeric variables

associated with a pair of nominal values

See more: https://en.wikipedia.org/wiki/Correspondence_analysis


http://www.mathematica-journal.com/2010/09/an-introduction-to-correspondence-analysis/

Data Foundations - 62
Mapping Nominal Dimensions to Numbers

! Mapping to numbers by looking at similarities between the numeric variables

associated with a pair of nominal values

! If the statistical properties of the records associated with one nominal value are

sufficiently similar to the properties of a different value, then that implies that

these two values should likely be mapped to similar numeric values.

See more: https://en.wikipedia.org/wiki/Correspondence_analysis


http://www.mathematica-journal.com/2010/09/an-introduction-to-correspondence-analysis/

Data Foundations - 62
Mapping Nominal Dimensions to Numbers

! Mapping to numbers by looking at similarities between the numeric variables

associated with a pair of nominal values

! If the statistical properties of the records associated with one nominal value are

sufficiently similar to the properties of a different value, then that implies that

these two values should likely be mapped to similar numeric values.

! Conversely, if there are sufficient differences in properties, then likely they should

be mapped to quite distinct values.

See more: https://en.wikipedia.org/wiki/Correspondence_analysis


http://www.mathematica-journal.com/2010/09/an-introduction-to-correspondence-analysis/

Data Foundations - 62
Mapping Nominal Dimensions to Numbers

! Mapping to numbers by looking at similarities between the numeric variables

associated with a pair of nominal values

! If the statistical properties of the records associated with one nominal value are

sufficiently similar to the properties of a different value, then that implies that

these two values should likely be mapped to similar numeric values.

! Conversely, if there are sufficient differences in properties, then likely they should

be mapped to quite distinct values.

! Given all the pairwise similarities, we could use correspondence analysis to map the

different nominal values to positions in one dimension. Applying to all nominal

dimensions of the data set - multiple correspondence analysis.

See more: https://en.wikipedia.org/wiki/Correspondence_analysis


http://www.mathematica-journal.com/2010/09/an-introduction-to-correspondence-analysis/

Data Foundations - 62
Interactive Data Visualization

Other data processing topics

Data Foundations - 63
Segmentation

Data Foundations - 64
Segmentation

! In many situations, the data can be separated into contiguous regions, where

each region corresponds to a particular classification of the data.

Data Foundations - 64
Segmentation

! In many situations, the data can be separated into contiguous regions, where

each region corresponds to a particular classification of the data.

! Simple segmentation can be performed by just mapping disjoint ranges of the

data values to specific categories.

Data Foundations - 64
Segmentation

! In many situations, the data can be separated into contiguous regions, where

each region corresponds to a particular classification of the data.

! Simple segmentation can be performed by just mapping disjoint ranges of the

data values to specific categories.

! it is important to look at the classification of neighboring points to improve

the confidence of classification, or even to do a probabilistic segmentation,

where each data point is assigned a probability for belonging to each of the

available classifications.

Data Foundations - 64
Segmentation

! In many situations, the data can be separated into contiguous regions, where

each region corresponds to a particular classification of the data.

! Simple segmentation can be performed by just mapping disjoint ranges of the

data values to specific categories.

! it is important to look at the classification of neighboring points to improve

the confidence of classification, or even to do a probabilistic segmentation,

where each data point is assigned a probability for belonging to each of the

available classifications.

! Common in image data or geo-spatial data (satellite images)

Data Foundations - 64
Segmentation

Data Foundations - 65
Sampling and subsetting

Data Foundations - 66
Sampling and subsetting

! To transform a data set with one spatial resolution into another data set with a

different spatial resolution. For example, we might have an image we would

like to shrink or expand, or we might have only a small sampling of data

points and wish to fill in values for locations between our samples (assuming

that the data is a discrete sampling of a continuous phenomenon).

Data Foundations - 66
Sampling and subsetting

! To transform a data set with one spatial resolution into another data set with a

different spatial resolution. For example, we might have an image we would

like to shrink or expand, or we might have only a small sampling of data

points and wish to fill in values for locations between our samples (assuming

that the data is a discrete sampling of a continuous phenomenon).

! The process of interpolation is a commonly used resampling method in many

fields, including visualization:

! Linear interpolation

! bi-linear interpolation

! Nonlinear interpolation

Data Foundations - 66
Sampling and subsetting

! Data subsetting is also a frequently used operation both prior to and during

visualization.

! This is especially helpful for very large data sets, as the visualization of the

entire data set may lead to substantial visual clutter.

! Query before visualization versus subsetting during visualization

Data Foundations - 67
Aggregation and Summarization

Data Foundations - 68
Aggregation and Summarization

! it is often useful to group data points based on their similarity in value and/or

position and represent the group by some smaller amount of data:

Data Foundations - 68
Aggregation and Summarization

! it is often useful to group data points based on their similarity in value and/or

position and represent the group by some smaller amount of data:

! Data Clustering methods

" See More:

− https://en.wikipedia.org/wiki/Cluster_analysis

− http://www.ise.bgu.ac.il/faculty/liorr/hbchap15.pdf

Data Foundations - 68
Aggregation and Summarization

! it is often useful to group data points based on their similarity in value and/or

position and represent the group by some smaller amount of data:

! Data Clustering methods

" See More:

− https://en.wikipedia.org/wiki/Cluster_analysis

− http://www.ise.bgu.ac.il/faculty/liorr/hbchap15.pdf

! Displaying the clusters (or their representation)

" Provide sufficient information for the user to decide whether he or she wishes to

perform a drill-down on the data

Data Foundations - 68
Aggregation and Summarization

Data Foundations - 69
Smoothing and Filtering

Data Foundations - 70
Smoothing and Filtering

! In statistics and image processing, to smooth a data set is to create an

approximating function that attempts to capture important patterns in the

data, while leaving out noise or other fine-scale structures/rapid phenomena.

Data Foundations - 70
Smoothing and Filtering

! In statistics and image processing, to smooth a data set is to create an

approximating function that attempts to capture important patterns in the

data, while leaving out noise or other fine-scale structures/rapid phenomena.

! In smoothing, the data points of a signal are modified so individual points

(presumably because of noise) are reduced, and points that are lower than the

adjacent points are increased leading to a smoother signal

Data Foundations - 70
Smoothing and Filtering

! In statistics and image processing, to smooth a data set is to create an

approximating function that attempts to capture important patterns in the

data, while leaving out noise or other fine-scale structures/rapid phenomena.

! In smoothing, the data points of a signal are modified so individual points

(presumably because of noise) are reduced, and points that are lower than the

adjacent points are increased leading to a smoother signal

! See more:

! https://en.wikipedia.org/wiki/Smoothing

Data Foundations - 70
Raster to vector conversion

Data Foundations - 71
Raster to vector conversion

! In Computer Graphics:

! Vector data (vertices, edges, and triangular or quadrilateral patches) => Image

(pixel-based)

Data Foundations - 71
Raster to vector conversion

! In Computer Graphics:

! Vector data (vertices, edges, and triangular or quadrilateral patches) => Image

(pixel-based)

! It can be important to make the reverse:

" Compressing the contents for transmission.

" Comparing the contents of two or more images

" Transforming the data

" Segmenting the data

Data Foundations - 71
Raster to vector conversion

! In Computer Graphics:

! Vector data (vertices, edges, and triangular or quadrilateral patches) => Image

(pixel-based)

! It can be important to make the reverse:

" Compressing the contents for transmission.

" Comparing the contents of two or more images

" Transforming the data

" Segmenting the data

! Read more: IDV: Foundations, Techniques, and Applications, Pag 72 - 74

Data Foundations - 71
Interactive Data Visualization

Further Reading and Summary

Data Foundations - 72
Further Reading
! Recommend Readings

" Pag 51 - 76 from Interactive Data Visualization: Foundations, Techniques, and

Applications

" Pag 30 - 40 from Visualization Analysis & Design, Tamara Munzner

! Supplemental readings:

" https://en.wikipedia.org/wiki/Outlier

" https://en.wikipedia.org/wiki/Cluster_analysis

" https://en.wikipedia.org/wiki/Correspondence_analysis

" https://en.wikipedia.org/wiki/Cluster_analysis

Further Reading and Summary Data Foundations - 73


What you should know
! The concept of variable or dimension and the diference between independent and
dependent variables.

" grocking the data => take decisions

! The various data types taxonomies and the impact of a data type in visualization.

" numeric vs non numeric; oder vs non-order; Types of scale;

! The structural aspects of a data set.

" Tables, links, position, grid, etc.

! Data pre-processing techniques: the goal of each one and the most important ones

" Outlier detection and process; normalization; dimensionality reduction, Sampling and
subsetting; Aggregation and Summarization

Data Foundations - 74
Recommended Actions
! Install Tableau software (desktop version). Activate with a students license.

! http://www.tableau.com/academic/students

! To get an overview of Tableau see the video:

! http://www.tableau.com/learn/tutorials/on-demand/getting-started

! Get familiar with the dataset 2004 Cars and Trucks Data Set

! http://www.idvbook.com/teaching-aid/teaching-aid/data-sets/2004-cars-and-trucks-data/

Data Foundations - 75

You might also like