Unit 4 - Data Visualization
Unit 4 - Data Visualization
Data Visualization
Introduction to data visualization
Visualization is the graphical representation of data that can make information easy to
analyze and understand. Data visualization has the power of illustrating complex data
relationships and patterns with the help of simple designs consisting of lines, shapes, and
colors.
Enterprises are in a constant chase to hunt and use the latest advanced visual analytics tools
so as to be able to present their information through simple, precise, illustrative and attractive
diagrams.
Data visualization is considered as an evolving blend of art and science that has brought a
revolutionary change in the corporate sectors and will also continue to do so over the next
few years. It plays a major role in the decision making of the analytics world.
Question: What are the uses of data visualization?
To summarize, data visualization can help in:
1. Identify outliers in data: An outlier is a data point that is aloof or far away from other
related data points. Outliers may occur due to several reasons such as measurement error,
data entry error, experimental error, intentional inclusion of outliers, sampling error or natural
occurrence of outliers.
For data analysis, outliers should be excluded from the dataset as much as possible as these
outliers may mislead the analysis process resulting in abruptly different, incorrect results and
longer training time. With the help of data visualization, outliers in data can be easily
detected so as to be removed for further analysis.
2.Improve response time: Data visualization gives a quick glance of the entire data and, in
turn, allows analysts or scientists to quickly identify issues, thereby improving response time.
This is in contrast to huge chunks of information that may be displayed in textual or tabular
format covering multiple lines or records.
3.Greater simplicity: Data, when displayed graphically in a concise format, allows analysts or
scientists to examine the picture easily. The data to be concentrated on gets simplified as
analysts or scientists interact only with relevant data.
4. Easier visualization of patterns: Data presented graphically permits analysts to effortlessly
understand the content for identifying new patterns and trends that are otherwise almost
impossible to analyze. Trend analysis or time-series analysis are in huge demand in the
market for a continuous study of trends in the stock market, companies or business sectors.
5. Business analysis made easy: Business analysts can deal with various important decision-
making such as sales prediction, product promotion, and customer behaviour through the use
of correct data visualization techniques.
6. Enhanced collaboration: Advanced visualization tools make it easier for teams to
collaboratively go through the reports for instant decision-making.
Page 1 of 32
The findings in a visualization graph may be subtle, yet it can create a profound impact on a
data analyst to interpret the information easily. The most challenging part, however, is to
learn how data visualization works and in which case which visualization tool serves the best
purpose for analyzing precise information.
What is data visualization?
“A picture speaks a thousand words.”
Similarly an infographic/visual can help us analyze data and hidden patterns in a much easier
way.
Why visualize data?
Data visualization is a way you can create a story through your data. When data is complex
and understanding the micro-details is essential, the best way is to analyze data through
visuals.
Question: What are the two purposes of data visualization?
Visuals can be used for two purposes:
1. Exploratory data analysis: This is used by data analysts, statisticians, and data scientists to
better understand data. As it is rightly called, it is used to explore the hidden trends, patterns
in data.
2. Explanatory data analysis: Once the analysts understand the data and find their results, the
best way to convey their ideas and findings is through visuals! This is used to craft a story
that will appeal to the viewer offering deeper insights.
Page 2 of 32
1. A large portion i.e from 30 to 80 years overlaps between the two classes.
2. People in the age group of 20-40 are more likely to survive,40-60 are more likely to
not survive,60-80 age groups have equal chances of survival and death, and >80 have
higher chances of not surviving.
3. Age alone cannot distinguish if a person will survive or not.
Box-Plot
Box-plots tell us about the distribution of data and scan for outliers. Notice that the survivors
have fewer nodes than those who could not survive. Interesting! Isn’t it. Also notice that even
though the number of nodes is a more useful feature, there is some overlap with both the
classes.
Scatter-Plot
Page 3 of 32
We see from the scattered points that irrespective of the year, the number of patients having 0
nodes have been survivors. Does this mean that 0 nodes ensure survival? See the violin plot!
Violin-Plot
From the plot above, we see that there are non-survivors with 0 nodes! Violin plots enable us
to view the distribution and box plots in one visual. Useful! Isn’t it? There is so much we can
learn from the visuals. Visualize to understand. Visualize to explain your understanding. I
have compiled a few tips and tools to get you started.
Page 4 of 32
16. Parallel Coordinates Plot
17. Point & Figure Chart
18. Population Pyramid
19. Radar Chart
20. Radial Bar Chart
21. Radial Column Chart
22. Scatterplot
23. Span Chart
24. Spiral Plot
25. Stacked Area Graph
26. Stacked Bar Graph
27. Stream Graph
28. Violin Plot
29. Diagrams
30. Arc Diagram
31. Brainstorm
32. Chord Diagram
33. Flow Chart
34. Illustration Diagram
35. Network Diagram
36. Non-ribbon Chord Diagram
37. Sankey Diagram
38. Timeline
39. Tree Diagram
40. Venn Diagram
41. Tables
42. Calendar
43. Gantt Chart
44. Heatmap
45. Stem & Leaf Plot
46. Tally Chart
47. Time Table
48. Other
49. Circle Packing
50. Donut Chart
51. Dot Matrix Chart
52. Nightingale Rose Chart
53. Parallel Sets
54. Pictogram Chart
55. Pie Chart
56. Proportional Area Chart
57. Sunburst Diagram
58. Treemap
59. Word Cloud
60. Maps/Geographical
61. Bubble Map
62. Choropleth Map
63. Connection Map
64. Dot Map
65. Flow Map
Page 5 of 32
Question: What is visual encoding?
Visual encoding:Visual encoding is the approach used to map data into visual structures,
thereby building an image on the screen.
There are two types of visual encoding variables:
i) planar
ii) and retinal.
Humans are sensitive to the retinal variables. They easily differentiate between various
colors, shapes, sizes and other properties.
Retinal variables were introduced by Bertin (→) about 40 years ago, and this concept has
become quite popular recently. While there's some critique about the effectiveness of retinal
variables (→), most specialists find them useful.
Question: What is the important consideration that is made with visual encoding? What
attribute values signify?
This is an important consideration to be made as one visualization tool can be more effective
than the other visualization tool due to the easy perception of information conveyed by the
former visualization graph than the latter. The attribute values signify important data
characteristics such as numerical data, categorical data, or ordinal data. Spatiotemporal data
contains special attributes such as geographical location (spatial dimension) and/or
time(temporal dimension).
Question: What is the visualization graph supposed to display? List out with the help of a
diagram.
Page 6 of 32
The main question for visual encoding lies in the fact :
‘What do we want to portray with the given data?’
Figure 4.1 illustrates the several concepts that a visualization graph may like to convey based
on which a particular visualization tool is used. While simple data comparisons can be made
with a bar chart and column chart, data composition can be expressed with the help of a pie
chart or stacked column chart.
The use of an appropriate visualization graph is a challenging task and should be considered
an important factor for data analysis in data science.
Table 4.1 gives a basic idea of which visualization graph can be used to portray the accurate
role of data provided in a dataset:
Table 4.1: Role of data visualization and its corresponding visualization tool
Page 7 of 32
2. Color hue plays an important role in data visualization as for instance, the red color
signifies something alarming, the blue color signifies something serene and peaceful,
while the yellow color signifies something bright and attractive.
3. Shape , such as circle, oval, diamond and rectangle, may signify different types of
data and is easily recognized by the eye for the distinguished
look.
4. Orientation , such as vertical, horizontal and slanted, help in signifying data trends
such as an upward trend or a downward trend.
5. Color Saturation decides the intensity of color and can be used to differentiate visual
elements from their surroundings by displaying different scales of values.
6. Length decides the proportion of data and is a good visualization parameter for
comparing data of varying values.
7. Angles provide a sense of proportion and this characteristic can help data analysts or
data scientists make better data comparisons.
8. Texture show differentiation among data and is mainly used for data comparisons.
Visualization tools should be effectively chosen based on what type of data we need to
represent in the visualization graph. While on one hand, varying shapes can be used to
represent nominal data, on the other hand, various shadings of a particular color can be used
for mapping data that has a particular ranking or order (as in case of ordinal data).
For numerical data, such as interval data or ratio data, the change in position or length in the
graph can best signify the values or patterns of data.
Page 8 of 32
These cues are not created equal, however. In the mid-1980s, statisticians William Cleveland
and Robert McGill ran some experiments with human volunteers, measuring how accurately
they were able to perceive the quantitative information encoded by different cues. This is
what they found:
This perceptual hierarchy of visual cues is important. When making comparisons with
continuous variables, aim to use cues near the top of the scale wherever possible.
Page 9 of 32
Data types
There are three basic types of data:
i) something you can count,
ii) something you can order
iii) and something you can just differentiate.
As often is the case, these types get down to three un-intuitive terms:
X and Y
Planar variables are known to everybody. If you've studied maths, you've beed drawing
graphs across the X- and Y-axis.
Planar variables work for any data type. They work great to present any quantitative data. It's
a pity that we have to deal wth the flat screens and just two planar variables. Well, we can try
to use Z-axis, but 3D charts look horrible on screen in 95.8% of cases.
Page 10 of 32
So what should we do then to present three or more variables? We can use the retinal
variables!
Size
We know that size does matter. You can see the difference right away. Small is innocuous,
large isdangerous perhaps. Size is a good visualizer for the quantitative data.
Texture
Texture is less common. You can't touch it on screen, and it's usually less catchy than color.
So, in theory texture can be used for soft encoding, but in practice it's better to pass on it.
Page 11 of 32
Shape
Round circles ○, edgy stars ☆, solid rectangles █. We can easily distinguish dozens of
shapes. They do work well sometimes for the visual encoding of categories.
Orientation
Orientation is tricky.
While we're able to clearly identify vertical vs. horizontal lines, it is harder to use it properly
for visual encoding.
Color Value
Any color value can be moved over a scale. Greyscale is a good example. While we can't be
certain that#999 color is lighter than #888, still it's a helpful technique to visualize the
ordered data.
Page 12 of 32
Color Hue
Red color is alarming. Green color is calm. Blue color is peaceful. Colors are great to
separate categories.
Color is the most interesting variable, let's dig into some details here. There are three
different scales that we can use with color. We've already mentioned two of them: the
categorical scale (color hue) and the sequential scale (color value).
Diverging scale is somewhat new. It encodes positive and negative values, e.g. temperatures
in range of -50 to +50 C. It would be a mistake to use any other color scales for that.
The general rule of thumb is that you can use no more than a dozen colors to encode
categories effectively. If there's more, it'd be hard to differentiate between categories quickly.
These are the most commonly used colors:
Page 13 of 32
“Avoiding catastrophe becomes the first principle in bringing color to information: Above
all, do no harm.”—Tufte
The next obvious question is:
Question: How to Apply the Retinal Variables to Data?
It is quite clear that we can't use all variables to present any data types. For example, it is
wrong to use color to represent numbers (1, 2, 3).
And it is bad to use size to represent various currencies (€, £ , ¥).
Why on Earth should small circles stand for euro, and large circles for pounds?
Here's the retinal variables usage summary:
Note that planar variables can be applied to all the data types.
Page 14 of 32
Indeed, we can use the X-axis for categories, ordered variables or numbers.
OK, now let's tap on some techniques to visualize real data. Sample data is very simple, we
just want to visualize quantity of items:
We have just two variables: Item Types (Categorical) and Items Quantity (well,
Quantitative). All the possible choices are based on the table above:
In theory, you can mix these variables as you wish. I'm going to try four combinations.
Shape + Value
Page 15 of 32
Hmm, looks like a puzzle. Value doesn't work for the quantitative data, it seems. Let's try
something else!
Color + Size
Well, slightly better. The color coding works for entity types. For example, in TargetProcess
we've got green Features, red Bugs and blue User Stories. Still not very good.
A very simple rule in visualizations is to never map scalar data to circle radii. Humans do
better in comparing relative areas, so if you want to map data to a shape, you have to map it
to it's area. (→)
Texture + Y
Page 16 of 32
Almost great. But why this legend with texture? Can we just remove it? Yes! Let's use the X
and Y planar variables.
X+Y
Now we have the best result! It turned out that X+Y works great for a simple data set with
just two variables. So, there's no need to use retinal variables at all.
Retinal variables should be used if you need to present three or more data sources.
Three is quite trivial, so we'll take four variables. Say, we have bugs, stories, and tasks and
we want to visualize some properties of these entities:
Types
Priority
Average Effort in Points
Average Cycle Time in Days (→)
Page 17 of 32
Type Priority Average Effort Average Cycle Time
Features Must Have 30 40
Features Good 20 40
Features Nice to Have 15 20
Bugs Fix ASAP 2 2
Bugs Fix 2 8
Bugs Fix if Time 5 12
User Stories Must Have 8 10
User Stories Good 5 7
User Stories Nice to Have 8 7
We need to pick four variables. Surely, there're other choices, but here's what I've selected:
Now it's easy to draw the chart. The important bugs are shown in deep red, the unimportant
ones — in light red. The same pattern applies to features and user stories
Page 18 of 32
What can we say about this chart? Here are some useful observations:
Bugs are usually are smaller than user stories, and features are the largest entities.
Important bugs are small and get fixed quickly.
Important features are the largest, and it takes more time to release them (interesting
information, by the way!).
Unimportant bugs are the largest, and it takes longer to fix them.
There's a good correlation between effort and cycle time: it takes more time to deliver
large entities.
Of course, you can get the same info from the plain table above, but the chart is much more
fun to explore.
Page 19 of 32
Mapping variables to encodings
Page 20 of 32
# create object of Ordinalencoding
encoder= ce.OrdinalEncoder(cols=['Degree'],return_df=True,
mapping=[{'col':'Degree',
'mapping':{'None':0,'High school':1,'Diploma':2,'Bachelors':3,'Masters':4,'phd':5}}])
#Original data
train_df
Page 21 of 32
One Hot Encoding
We use this categorical data encoding technique when the features are nominal(do not have
any order). In one hot encoding, for each level of a categorical feature, we create a new
variable. Each category is mapped with a binary variable containing either 0 or 1. Here, 0
represents the absence, and 1 represents the presence of that category.
These newly created binary features are known as Dummy variables. The number of dummy
variables depends on the levels present in the categorical variable. This might sound
complicated. Let us take an example to understand this better. Suppose we have a dataset
with a category animal, having different animals like Dog, Cat, Sheep, Cow, Lion. Now we
have to one-hot encode this data.
After encoding, in the second table, we have dummy variables each representing a category
in the feature Animal. Now for each category that is present, we have 1 in the column of that
category and 0 for the others. Let’s see how to implement a one-hot encoding in python.
import category_encoders as ce
import pandas as pd
data=pd.DataFrame({'City':[
'Delhi','Mumbai','Hydrabad','Chennai','Bangalore','Delhi','Hydrabad','Bangalore','Delhi'
]})
#Original Data
data
Page 22 of 32
#Fit and transform Data
data_encoded = encoder.fit_transform(data)
data_encoded
Now let’s move to another very interesting and widely used encoding technique i.e Dummy
encoding.
Dummy Encoding
Dummy coding scheme is similar to one-hot encoding. This categorical data encoding
method transforms the categorical variable into a set of binary variables (also known as
dummy variables). In the case of one-hot encoding, for N categories in a variable, it uses N
binary variables. The dummy encoding is a small improvement over one-hot-encoding.
Dummy encoding uses N-1 features to represent N labels/categories.
To understand this better let’s see the image below. Here we are coding the same data using
both one-hot encoding and dummy encoding techniques. While one-hot uses 3 variables to
represent the data whereas dummy encoding uses 2 variables to code 3 categories.
Page 23 of 32
Let us implement it in python.
import category_encoders as ce
import pandas as pd
data=pd.DataFrame({'City':['Delhi','Mumbai','Hyderabad','Chennai','Bangalore','Delhi,'Hyder
abad']})
#Original Data
data
Page 24 of 32
Here using drop_first argument, we are representing the first label Bangalore using 0.
Drawbacks of One-Hot and Dummy Encoding
One hot encoder and dummy encoder are two powerful and effective encoding schemes.
They are also very popular among the data scientists, But may not be as effective when-
1. A large number of levels are present in data. If there are multiple categories in a
feature variable in such a case we need a similar number of dummy variables to
encode the data. For example, a column with 30 different values will require 30 new
variables for coding.
2. If we have multiple categorical features in the dataset similar situation will occur and
again we will end to have several binary features each representing the categorical
feature and their multiple categories e.g a dataset having 10 or more categorical
columns.
In both the above cases, these two encoding schemes introduce sparsity in the dataset i.e
several columns having 0s and a few of them having 1s. In other words, it creates multiple
dummy features in the dataset without adding much information.
Also, they might lead to a Dummy variable trap. It is a phenomenon where features are
highly correlated. That means using the other variables, we can easily predict the value of a
variable.
Due to the massive increase in the dataset, coding slows down the learning of the model
along with deteriorating the overall performance that ultimately makes the model
computationally expensive. Further, while using tree-based models these encodings are not
an optimum choice.
Effect Encoding:
This encoding technique is also known as Deviation Encoding or Sum Encoding. Effect
encoding is almost similar to dummy encoding, with a little difference. In dummy coding, we
use 0 and 1 to represent the data but in effect encoding, we use three values i.e. 1,0, and -1.
The row containing only 0s in dummy encoding is encoded as -1 in effect encoding. In the
dummy encoding example, the city Bangalore at index 4 was encoded as 0000. Whereas in
effect encoding it is represented by -1-1-1-1.
Let us see how we implement it in python-
import category_encoders as ce
import pandas as pd
data=pd.DataFrame({'City':['Delhi','Mumbai','Hyderabad','Chennai','Bangalore','Delhi,'Hyder
abad']}) encoder=ce.sum_coding.SumEncoder(cols='City',verbose=False,)
Page 25 of 32
#Original Data
data
encoder.fit_transform(data)
Effect encoding is an advanced technique. In case you are interested to know more about
effect encoding, refer to this interesting paper.
Hash Encoder
To understand Hash encoding it is necessary to know about hashing. Hashing is the
transformation of arbitrary size input in the form of a fixed-size value. We use hashing
algorithms to perform hashing operations i.e to generate the hash value of an input. Further,
hashing is a one-way process, in other words, one can not generate original input from the
hash representation.
Hashing has several applications like data retrieval, checking data corruption, and in data
encryption also. We have multiple hash functions available for example Message Digest
(MD, MD2, MD5), Secure Hash Function (SHA0, SHA1, SHA2), and many more.
Just like one-hot encoding, the Hash encoder represents categorical features using the new
dimensions. Here, the user can fix the number of dimensions after transformation
Page 26 of 32
using n_component argument. Here is what I mean – A feature with 5 categories can be
represented using N new features similarly, a feature with 100 categories can also be
transformed using N new features. Doesn’t this sound amazing?
By default, the Hashing encoder uses the md5 hashing algorithm but a user can pass any
algorithm of his choice. If you want to explore the md5 algorithm, I suggest this paper.
import category_encoders as ce
import pandas as pd
Page 27 of 32
Since Hashing transforms the data in lesser dimensions, it may lead to loss of information.
Another issue faced by hashing encoder is the collision. Since here, a large number of
features are depicted into lesser dimensions, hence multiple values can be represented by the
same hash value, this is known as a collision.
Moreover, hashing encoders have been very successful in some Kaggle competitions. It is
great to try if the dataset has high cardinality features.
Binary Encoding
Binary encoding is a combination of Hash encoding and one-hot encoding. In this encoding
scheme, the categorical feature is first converted into numerical using an ordinal encoder.
Then the numbers are transformed in the binary number. After that binary value is split into
different columns.
Binary encoding works really well when there are a high number of categories. For example
the cities in a country where a company supplies its products.
#Import the libraries
import category_encoders as ce
import pandas as pd
#Original Data
data
Page 28 of 32
#Fit and Transform Data
data_encoded=encoder.fit_transform(data)
data_encoded
Binary encoding is a memory-efficient encoding scheme as it uses fewer features than one-
hot encoding. Further, It reduces the curse of dimensionality for data with high cardinality.
Page 29 of 32
Base N Encoding
Before diving into BaseN encoding let’s first try to understand what is Base here?
In the numeral system, the Base or the radix is the number of digits or a combination of digits
and letters used to represent the numbers. The most common base we use in our life is 10 or
decimal system as here we use 10 unique digits i.e 0 to 9 to represent all the numbers.
Another widely used system is binary i.e. the base is 2. It uses 0 and 1 i.e 2 digits to express
all the numbers.
For Binary encoding, the Base is 2 which means it converts the numerical values of a
category into its respective Binary form. If you want to change the Base of encoding scheme
you may use Base N encoder. In the case when categories are more and binary encoding is
not able to handle the dimensionality then we can use a larger base such as 4 or 8.
#Import the libraries
import category_encoders as ce
import pandas as pd
#Original Data
data
Page 30 of 32
In the above example, I have used base 5 also known as the Quinary system. It is similar to
the example of Binary encoding. While Binary encoding represents the same data by 4 new
features the BaseN encoding uses only 3 new variables.
Hence BaseN encoding technique further reduces the number of features required to
efficiently represent the data and improving memory usage. The default Base for Base N is 2
which is equivalent to Binary Encoding.
Target Encoding
Target encoding is a Baysian encoding technique.
Bayesian encoders use information from dependent/target variables to encode the categorical
data.
In target encoding, we calculate the mean of the target variable for each category and replace
the category variable with the mean value. In the case of the categorical target variables, the
posterior probability of the target replaces each category..
#import the libraries
import pandas as pd
import category_encoders as ce
#Original Data
Data
Page 31 of 32
#Fit and Transform Train Data
encoder.fit_transform(data['class'],data['Marks'])
We perform Target encoding for train data only and code the test data using results obtained
from the training dataset.
Page 32 of 32