0% found this document useful (0 votes)
10 views

Unit 4 - Data Visualization

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Unit 4 - Data Visualization

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

Unit-4

Data Visualization
Introduction to data visualization
Visualization is the graphical representation of data that can make information easy to
analyze and understand. Data visualization has the power of illustrating complex data
relationships and patterns with the help of simple designs consisting of lines, shapes, and
colors.
Enterprises are in a constant chase to hunt and use the latest advanced visual analytics tools
so as to be able to present their information through simple, precise, illustrative and attractive
diagrams.
Data visualization is considered as an evolving blend of art and science that has brought a
revolutionary change in the corporate sectors and will also continue to do so over the next
few years. It plays a major role in the decision making of the analytics world.
Question: What are the uses of data visualization?
To summarize, data visualization can help in:
1. Identify outliers in data: An outlier is a data point that is aloof or far away from other
related data points. Outliers may occur due to several reasons such as measurement error,
data entry error, experimental error, intentional inclusion of outliers, sampling error or natural
occurrence of outliers.
For data analysis, outliers should be excluded from the dataset as much as possible as these
outliers may mislead the analysis process resulting in abruptly different, incorrect results and
longer training time. With the help of data visualization, outliers in data can be easily
detected so as to be removed for further analysis.
2.Improve response time: Data visualization gives a quick glance of the entire data and, in
turn, allows analysts or scientists to quickly identify issues, thereby improving response time.
This is in contrast to huge chunks of information that may be displayed in textual or tabular
format covering multiple lines or records.
3.Greater simplicity: Data, when displayed graphically in a concise format, allows analysts or
scientists to examine the picture easily. The data to be concentrated on gets simplified as
analysts or scientists interact only with relevant data.
4. Easier visualization of patterns: Data presented graphically permits analysts to effortlessly
understand the content for identifying new patterns and trends that are otherwise almost
impossible to analyze. Trend analysis or time-series analysis are in huge demand in the
market for a continuous study of trends in the stock market, companies or business sectors.
5. Business analysis made easy: Business analysts can deal with various important decision-
making such as sales prediction, product promotion, and customer behaviour through the use
of correct data visualization techniques.
6. Enhanced collaboration: Advanced visualization tools make it easier for teams to
collaboratively go through the reports for instant decision-making.

Page 1 of 32
The findings in a visualization graph may be subtle, yet it can create a profound impact on a
data analyst to interpret the information easily. The most challenging part, however, is to
learn how data visualization works and in which case which visualization tool serves the best
purpose for analyzing precise information.
What is data visualization?
“A picture speaks a thousand words.”
Similarly an infographic/visual can help us analyze data and hidden patterns in a much easier
way.
Why visualize data?
Data visualization is a way you can create a story through your data. When data is complex
and understanding the micro-details is essential, the best way is to analyze data through
visuals.
Question: What are the two purposes of data visualization?
Visuals can be used for two purposes:
1. Exploratory data analysis: This is used by data analysts, statisticians, and data scientists to
better understand data. As it is rightly called, it is used to explore the hidden trends, patterns
in data.
2. Explanatory data analysis: Once the analysts understand the data and find their results, the
best way to convey their ideas and findings is through visuals! This is used to craft a story
that will appeal to the viewer offering deeper insights.

Exploratory analysis of Haberman’s Survival Data - Example


The dataset contains cases from a study that was conducted between 1958 and 1970 at the
University of Chicago’s Billings Hospital on the survival of patients who had undergone
surgery for breast cancer.
The attributes include:
1. Age of patient at the time of operation (numerical)
2. Patient’s year of operation (year – 1900, numerical)
3. Number of positive axillary nodes detected (numerical)
4. Survival status (class attribute)
1 = the patient survived 5 years or longer
2 = the patient died within 5 year
Let’s first start by using statistics to understand data:
We see there are 306 rows and 4 columns. Further upon seeing the attributes, we understand
how data is distributed. To further find out how many examples of each class we have, we
can use a bar chart.
We see that the data is imbalanced with more survivors than those who couldn’t survive. To
further scan the data, let’s see different plots.
Probability Density Function

Page 2 of 32
1. A large portion i.e from 30 to 80 years overlaps between the two classes.
2. People in the age group of 20-40 are more likely to survive,40-60 are more likely to
not survive,60-80 age groups have equal chances of survival and death, and >80 have
higher chances of not surviving.
3. Age alone cannot distinguish if a person will survive or not.
Box-Plot

Box-plots tell us about the distribution of data and scan for outliers. Notice that the survivors
have fewer nodes than those who could not survive. Interesting! Isn’t it. Also notice that even
though the number of nodes is a more useful feature, there is some overlap with both the
classes.
Scatter-Plot

Page 3 of 32
We see from the scattered points that irrespective of the year, the number of patients having 0
nodes have been survivors. Does this mean that 0 nodes ensure survival? See the violin plot!

Violin-Plot

From the plot above, we see that there are non-survivors with 0 nodes! Violin plots enable us
to view the distribution and box plots in one visual. Useful! Isn’t it? There is so much we can
learn from the visuals. Visualize to understand. Visualize to explain your understanding. I
have compiled a few tips and tools to get you started.

Data Visualization Tools


1. Tableau: Simple to use, effective and secure. It’s very popular and used to pre-process
and visualize data effectively. Data sharing is also possible.
2. Microsoft Power BI: Data Visualization platform focused on creating data-driven
solutions for business problems. It is used to pre-process, analyze and share
meaningful insights with ease. Other tools include FusionCharts, Dash, Plotly,
QlikView.
3. MS Excel: This is the most common tool used by analysts to quickly handle data, sort,
visualize and perform preprocessing on data.
Type of Charts/Graphs/Plots/Diagrams/Maps:
1. Graphs/Plots
2. Area Graph
3. Bar Chart
4. Box and Whisker Plot
5. Bubble Chart
6. Bullet Graph
7. Candlestick Chart
8. Density Plot
9. Error Bars
10. Histogram
11. Kagi Chart
12. Line Graph
13. Marimekko Chart
14. Multi-set Bar Chart
15. OHLC Chart

Page 4 of 32
16. Parallel Coordinates Plot
17. Point & Figure Chart
18. Population Pyramid
19. Radar Chart
20. Radial Bar Chart
21. Radial Column Chart
22. Scatterplot
23. Span Chart
24. Spiral Plot
25. Stacked Area Graph
26. Stacked Bar Graph
27. Stream Graph
28. Violin Plot
29. Diagrams
30. Arc Diagram
31. Brainstorm
32. Chord Diagram
33. Flow Chart
34. Illustration Diagram
35. Network Diagram
36. Non-ribbon Chord Diagram
37. Sankey Diagram
38. Timeline
39. Tree Diagram
40. Venn Diagram
41. Tables
42. Calendar
43. Gantt Chart
44. Heatmap
45. Stem & Leaf Plot
46. Tally Chart
47. Time Table
48. Other
49. Circle Packing
50. Donut Chart
51. Dot Matrix Chart
52. Nightingale Rose Chart
53. Parallel Sets
54. Pictogram Chart
55. Pie Chart
56. Proportional Area Chart
57. Sunburst Diagram
58. Treemap
59. Word Cloud
60. Maps/Geographical
61. Bubble Map
62. Choropleth Map
63. Connection Map
64. Dot Map
65. Flow Map

Page 5 of 32
Question: What is visual encoding?
Visual encoding:Visual encoding is the approach used to map data into visual structures,
thereby building an image on the screen.
There are two types of visual encoding variables:
i) planar
ii) and retinal.
Humans are sensitive to the retinal variables. They easily differentiate between various
colors, shapes, sizes and other properties.
Retinal variables were introduced by Bertin (→) about 40 years ago, and this concept has
become quite popular recently. While there's some critique about the effectiveness of retinal
variables (→), most specialists find them useful.

Question: What is the important consideration that is made with visual encoding? What
attribute values signify?

This is an important consideration to be made as one visualization tool can be more effective
than the other visualization tool due to the easy perception of information conveyed by the
former visualization graph than the latter. The attribute values signify important data
characteristics such as numerical data, categorical data, or ordinal data. Spatiotemporal data
contains special attributes such as geographical location (spatial dimension) and/or
time(temporal dimension).

Question: What is the visualization graph supposed to display? List out with the help of a
diagram.

Figure 4.1: Concepts of a visualization graph

Page 6 of 32
The main question for visual encoding lies in the fact :
‘What do we want to portray with the given data?’
Figure 4.1 illustrates the several concepts that a visualization graph may like to convey based
on which a particular visualization tool is used. While simple data comparisons can be made
with a bar chart and column chart, data composition can be expressed with the help of a pie
chart or stacked column chart.
The use of an appropriate visualization graph is a challenging task and should be considered
an important factor for data analysis in data science.
Table 4.1 gives a basic idea of which visualization graph can be used to portray the accurate
role of data provided in a dataset:
Table 4.1: Role of data visualization and its corresponding visualization tool

Question: What are the retinal variables?


Retinal variables: Mapping of data is based on the visual cues (also called retinal variables)
such as location, size, color value, color hue, color saturation, transparency, shape, structure,
orientation, and so on.
Question: Describe about the roles of retinal variables.
To represent data that involves three or more variables, these retinal variables play a major
role. For instance:
1. Size is an important visualizer for quantitative data as a smaller size indicates less
value while bigger size indicates more value.

Page 7 of 32
2. Color hue plays an important role in data visualization as for instance, the red color
signifies something alarming, the blue color signifies something serene and peaceful,
while the yellow color signifies something bright and attractive.
3. Shape , such as circle, oval, diamond and rectangle, may signify different types of
data and is easily recognized by the eye for the distinguished
look.
4. Orientation , such as vertical, horizontal and slanted, help in signifying data trends
such as an upward trend or a downward trend.
5. Color Saturation decides the intensity of color and can be used to differentiate visual
elements from their surroundings by displaying different scales of values.
6. Length decides the proportion of data and is a good visualization parameter for
comparing data of varying values.
7. Angles provide a sense of proportion and this characteristic can help data analysts or
data scientists make better data comparisons.
8. Texture show differentiation among data and is mainly used for data comparisons.
Visualization tools should be effectively chosen based on what type of data we need to
represent in the visualization graph. While on one hand, varying shapes can be used to
represent nominal data, on the other hand, various shadings of a particular color can be used
for mapping data that has a particular ranking or order (as in case of ordinal data).
For numerical data, such as interval data or ratio data, the change in position or length in the
graph can best signify the values or patterns of data.

Visualization: encoding data using visual cues


Whenever we visualize, we are encoding data using visual cues, or “mapping” data onto
variation in size, shape or color, and so on. There are various ways of doing this, as this
primer illustrates:

Page 8 of 32
These cues are not created equal, however. In the mid-1980s, statisticians William Cleveland
and Robert McGill ran some experiments with human volunteers, measuring how accurately
they were able to perceive the quantitative information encoded by different cues. This is
what they found:

This perceptual hierarchy of visual cues is important. When making comparisons with
continuous variables, aim to use cues near the top of the scale wherever possible.

Page 9 of 32
Data types
There are three basic types of data:
i) something you can count,
ii) something you can order
iii) and something you can just differentiate.
As often is the case, these types get down to three un-intuitive terms:

1. Quantitative: Anything that has exact numbers.

For example, Effort in points: 0, 1, 2, 3, 5, 8, 13.


Duration in days: 1, 4, 666.

2. Ordered / Qualitative: Anything that can be compared and ordered.

User Story Priority: Must Have, Great, Good, Not Sure.


Bug Severity: Blocking, Average, Who Cares.

3. Categorical: Everything else.

Entity types: Bugs, Stories, Features, Test Cases.


Fruits: Apples, Oranges, Plums.

Planar and Retinal Variables

Now how do we present data? We have several visual encoding variables.

X and Y

Planar variables are known to everybody. If you've studied maths, you've beed drawing
graphs across the X- and Y-axis.

Planar variables work for any data type. They work great to present any quantitative data. It's
a pity that we have to deal wth the flat screens and just two planar variables. Well, we can try
to use Z-axis, but 3D charts look horrible on screen in 95.8% of cases.

Page 10 of 32
So what should we do then to present three or more variables? We can use the retinal
variables!

Size

We know that size does matter. You can see the difference right away. Small is innocuous,
large isdangerous perhaps. Size is a good visualizer for the quantitative data.

Texture

Texture is less common. You can't touch it on screen, and it's usually less catchy than color.
So, in theory texture can be used for soft encoding, but in practice it's better to pass on it.

Page 11 of 32
Shape

Round circles ○, edgy stars ☆, solid rectangles █. We can easily distinguish dozens of
shapes. They do work well sometimes for the visual encoding of categories.

Orientation

Orientation is tricky.

While we're able to clearly identify vertical vs. horizontal lines, it is harder to use it properly
for visual encoding.

Color Value

Any color value can be moved over a scale. Greyscale is a good example. While we can't be
certain that#999 color is lighter than #888, still it's a helpful technique to visualize the
ordered data.

Page 12 of 32
Color Hue

Red color is alarming. Green color is calm. Blue color is peaceful. Colors are great to
separate categories.

Color in More Detail

Color is the most interesting variable, let's dig into some details here. There are three
different scales that we can use with color. We've already mentioned two of them: the
categorical scale (color hue) and the sequential scale (color value).

Diverging scale is somewhat new. It encodes positive and negative values, e.g. temperatures
in range of -50 to +50 C. It would be a mistake to use any other color scales for that.

There are six primary colors:

The general rule of thumb is that you can use no more than a dozen colors to encode
categories effectively. If there's more, it'd be hard to differentiate between categories quickly.
These are the most commonly used colors:

Page 13 of 32
“Avoiding catastrophe becomes the first principle in bringing color to information: Above
all, do no harm.”—Tufte
The next obvious question is:
Question: How to Apply the Retinal Variables to Data?
It is quite clear that we can't use all variables to present any data types. For example, it is
wrong to use color to represent numbers (1, 2, 3).
And it is bad to use size to represent various currencies (€, £ , ¥).
Why on Earth should small circles stand for euro, and large circles for pounds?
Here's the retinal variables usage summary:

Note that planar variables can be applied to all the data types.

Page 14 of 32
Indeed, we can use the X-axis for categories, ordered variables or numbers.

The Basic Example

OK, now let's tap on some techniques to visualize real data. Sample data is very simple, we
just want to visualize quantity of items:

Item Type Quantity


Features 3
Bugs 5
User Stories 6

We have just two variables: Item Types (Categorical) and Items Quantity (well,
Quantitative). All the possible choices are based on the table above:

Item Types Orientation


Color
Shape
Texture
X (or Y)
Item Quantity Orientation
Size
Value
X (or Y)

In theory, you can mix these variables as you wish. I'm going to try four combinations.

Shape + Value

Page 15 of 32
Hmm, looks like a puzzle. Value doesn't work for the quantitative data, it seems. Let's try
something else!

Color + Size

Well, slightly better. The color coding works for entity types. For example, in TargetProcess
we've got green Features, red Bugs and blue User Stories. Still not very good.

A very simple rule in visualizations is to never map scalar data to circle radii. Humans do
better in comparing relative areas, so if you want to map data to a shape, you have to map it
to it's area. (→)

Texture + Y

Page 16 of 32
Almost great. But why this legend with texture? Can we just remove it? Yes! Let's use the X
and Y planar variables.

X+Y

Now we have the best result! It turned out that X+Y works great for a simple data set with
just two variables. So, there's no need to use retinal variables at all.

Retinal variables should be used if you need to present three or more data sources.

The Four Variables Example

Three is quite trivial, so we'll take four variables. Say, we have bugs, stories, and tasks and
we want to visualize some properties of these entities:

 Types
 Priority
 Average Effort in Points
 Average Cycle Time in Days (→)

Here is our data:

Page 17 of 32
Type Priority Average Effort Average Cycle Time
Features Must Have 30 40
Features Good 20 40
Features Nice to Have 15 20
Bugs Fix ASAP 2 2
Bugs Fix 2 8
Bugs Fix if Time 5 12
User Stories Must Have 8 10
User Stories Good 5 7
User Stories Nice to Have 8 7

We need to pick four variables. Surely, there're other choices, but here's what I've selected:

Variable Type Encoding


Entity Type Categorical Color Hue
Priority Ordered Color Value
Average Effort in Points Quantitative X
Average Cycle Time in Days Quantitative Y

Now it's easy to draw the chart. The important bugs are shown in deep red, the unimportant
ones — in light red. The same pattern applies to features and user stories

Page 18 of 32
What can we say about this chart? Here are some useful observations:

 Bugs are usually are smaller than user stories, and features are the largest entities.
 Important bugs are small and get fixed quickly.
 Important features are the largest, and it takes more time to release them (interesting
information, by the way!).
 Unimportant bugs are the largest, and it takes longer to fix them.
 There's a good correlation between effort and cycle time: it takes more time to deliver
large entities.

Of course, you can get the same info from the plain table above, but the chart is much more
fun to explore.

Page 19 of 32
Mapping variables to encodings

 What is Categorical Data?


 Label Encoding or Ordinal Encoding
 One hot Encoding
 Dummy Encoding
 Effect Encoding
 Binary Encoding
 BaseN Encoding
 Hash Encoding
 Target Encoding

What is categorical data?


Since we are going to be working on categorical variables in this article, here is a quick
refresher on the same with a couple of examples. Categorical variables are usually
represented as ‘strings’ or ‘categories’ and are finite in number. Here are a few examples:
1. The city where a person lives: Delhi, Mumbai, Ahmedabad, Bangalore, etc.
2. The department a person works in: Finance, Human resources, IT, Production.
3. The highest degree a person has: High school, Diploma, Bachelors, Masters, PhD.
4. The grades of a student: A+, A, B+, B, B- etc.
In the above examples, the variables only have definite possible values. Further, we can see
there are two kinds of categorical data-
 Ordinal Data: The categories have an inherent order
 Nominal Data: The categories do not have an inherent order
In Ordinal data, while encoding, one should retain the information regarding the order in
which the category is provided. Like in the above example the highest degree a person
possesses, gives vital information about his qualification. The degree is an important feature
to decide whether a person is suitable for a post or not.
While encoding Nominal data, we have to consider the presence or absence of a feature. In
such a case, no notion of order is present. For example, the city a person lives in. For the data,
it is important to retain where a person lives. Here, We do not have any order or sequence. It
is equal if a person lives in Delhi or Bangalore.
For encoding categorical data, we have a python package category_encoders. The following
code helps you install easily.
pip install category_encoders

Label Encoding or Ordinal Encoding


We use this categorical data encoding technique when the categorical feature is ordinal. In
this case, retaining the order is important. Hence encoding should reflect the sequence.
In Label encoding, each label is converted into an integer value. We will create a variable that
contains the categories representing the education qualification of a person.
import category_encoders as ce
import pandas as pd
train_df=pd.DataFrame({'Degree':['High
school','Masters','Diploma','Bachelors','Bachelors','Masters','Phd','High school','High
school']})

Page 20 of 32
# create object of Ordinalencoding
encoder= ce.OrdinalEncoder(cols=['Degree'],return_df=True,
mapping=[{'col':'Degree',
'mapping':{'None':0,'High school':1,'Diploma':2,'Bachelors':3,'Masters':4,'phd':5}}])

#Original data
train_df

#fit and transform train data


df_train_transformed = encoder.fit_transform(train_df)

Page 21 of 32
One Hot Encoding
We use this categorical data encoding technique when the features are nominal(do not have
any order). In one hot encoding, for each level of a categorical feature, we create a new
variable. Each category is mapped with a binary variable containing either 0 or 1. Here, 0
represents the absence, and 1 represents the presence of that category.
These newly created binary features are known as Dummy variables. The number of dummy
variables depends on the levels present in the categorical variable. This might sound
complicated. Let us take an example to understand this better. Suppose we have a dataset
with a category animal, having different animals like Dog, Cat, Sheep, Cow, Lion. Now we
have to one-hot encode this data.

After encoding, in the second table, we have dummy variables each representing a category
in the feature Animal. Now for each category that is present, we have 1 in the column of that
category and 0 for the others. Let’s see how to implement a one-hot encoding in python.

import category_encoders as ce
import pandas as pd
data=pd.DataFrame({'City':[
'Delhi','Mumbai','Hydrabad','Chennai','Bangalore','Delhi','Hydrabad','Bangalore','Delhi'
]})

#Create object for one-hot encoding


encoder=ce.OneHotEncoder(cols='City',handle_unknown='return_nan',return_df=True,use_c
at_names=True)

#Original Data
data

Page 22 of 32
#Fit and transform Data
data_encoded = encoder.fit_transform(data)
data_encoded

Now let’s move to another very interesting and widely used encoding technique i.e Dummy
encoding.

Dummy Encoding
Dummy coding scheme is similar to one-hot encoding. This categorical data encoding
method transforms the categorical variable into a set of binary variables (also known as
dummy variables). In the case of one-hot encoding, for N categories in a variable, it uses N
binary variables. The dummy encoding is a small improvement over one-hot-encoding.
Dummy encoding uses N-1 features to represent N labels/categories.
To understand this better let’s see the image below. Here we are coding the same data using
both one-hot encoding and dummy encoding techniques. While one-hot uses 3 variables to
represent the data whereas dummy encoding uses 2 variables to code 3 categories.

Page 23 of 32
Let us implement it in python.
import category_encoders as ce
import pandas as pd
data=pd.DataFrame({'City':['Delhi','Mumbai','Hyderabad','Chennai','Bangalore','Delhi,'Hyder
abad']})

#Original Data
data

#encode the data


data_encoded=pd.get_dummies(data=data,drop_first=True)
data_encoded

Page 24 of 32
Here using drop_first argument, we are representing the first label Bangalore using 0.
Drawbacks of One-Hot and Dummy Encoding
One hot encoder and dummy encoder are two powerful and effective encoding schemes.
They are also very popular among the data scientists, But may not be as effective when-
1. A large number of levels are present in data. If there are multiple categories in a
feature variable in such a case we need a similar number of dummy variables to
encode the data. For example, a column with 30 different values will require 30 new
variables for coding.
2. If we have multiple categorical features in the dataset similar situation will occur and
again we will end to have several binary features each representing the categorical
feature and their multiple categories e.g a dataset having 10 or more categorical
columns.
In both the above cases, these two encoding schemes introduce sparsity in the dataset i.e
several columns having 0s and a few of them having 1s. In other words, it creates multiple
dummy features in the dataset without adding much information.
Also, they might lead to a Dummy variable trap. It is a phenomenon where features are
highly correlated. That means using the other variables, we can easily predict the value of a
variable.
Due to the massive increase in the dataset, coding slows down the learning of the model
along with deteriorating the overall performance that ultimately makes the model
computationally expensive. Further, while using tree-based models these encodings are not
an optimum choice.

Effect Encoding:
This encoding technique is also known as Deviation Encoding or Sum Encoding. Effect
encoding is almost similar to dummy encoding, with a little difference. In dummy coding, we
use 0 and 1 to represent the data but in effect encoding, we use three values i.e. 1,0, and -1.
The row containing only 0s in dummy encoding is encoded as -1 in effect encoding. In the
dummy encoding example, the city Bangalore at index 4 was encoded as 0000. Whereas in
effect encoding it is represented by -1-1-1-1.
Let us see how we implement it in python-
import category_encoders as ce
import pandas as pd
data=pd.DataFrame({'City':['Delhi','Mumbai','Hyderabad','Chennai','Bangalore','Delhi,'Hyder
abad']}) encoder=ce.sum_coding.SumEncoder(cols='City',verbose=False,)

Page 25 of 32
#Original Data
data

encoder.fit_transform(data)

Effect encoding is an advanced technique. In case you are interested to know more about
effect encoding, refer to this interesting paper.

Hash Encoder
To understand Hash encoding it is necessary to know about hashing. Hashing is the
transformation of arbitrary size input in the form of a fixed-size value. We use hashing
algorithms to perform hashing operations i.e to generate the hash value of an input. Further,
hashing is a one-way process, in other words, one can not generate original input from the
hash representation.
Hashing has several applications like data retrieval, checking data corruption, and in data
encryption also. We have multiple hash functions available for example Message Digest
(MD, MD2, MD5), Secure Hash Function (SHA0, SHA1, SHA2), and many more.
Just like one-hot encoding, the Hash encoder represents categorical features using the new
dimensions. Here, the user can fix the number of dimensions after transformation

Page 26 of 32
using n_component argument. Here is what I mean – A feature with 5 categories can be
represented using N new features similarly, a feature with 100 categories can also be
transformed using N new features. Doesn’t this sound amazing?
By default, the Hashing encoder uses the md5 hashing algorithm but a user can pass any
algorithm of his choice. If you want to explore the md5 algorithm, I suggest this paper.
import category_encoders as ce
import pandas as pd

#Create the dataframe


data=pd.DataFrame({'Month':['January','April','March','April','Februay','June','July','June','Sep
tember']})

#Create object for hash encoder


encoder=ce.HashingEncoder(cols='Month',n_components=6)

#Fit and Transform Data


encoder.fit_transform(data)

Page 27 of 32
Since Hashing transforms the data in lesser dimensions, it may lead to loss of information.
Another issue faced by hashing encoder is the collision. Since here, a large number of
features are depicted into lesser dimensions, hence multiple values can be represented by the
same hash value, this is known as a collision.
Moreover, hashing encoders have been very successful in some Kaggle competitions. It is
great to try if the dataset has high cardinality features.

Binary Encoding
Binary encoding is a combination of Hash encoding and one-hot encoding. In this encoding
scheme, the categorical feature is first converted into numerical using an ordinal encoder.
Then the numbers are transformed in the binary number. After that binary value is split into
different columns.
Binary encoding works really well when there are a high number of categories. For example
the cities in a country where a company supplies its products.
#Import the libraries
import category_encoders as ce
import pandas as pd

#Create the Dataframe


data=pd.DataFrame({'City':['Delhi','Mumbai','Hyderabad','Chennai','Bangalore','Delhi','Hyder
abad','Mumbai','Agra']})

#Create object for binary encoding


encoder= ce.BinaryEncoder(cols=['city'],return_df=True)

#Original Data
data

Page 28 of 32
#Fit and Transform Data
data_encoded=encoder.fit_transform(data)
data_encoded

Binary encoding is a memory-efficient encoding scheme as it uses fewer features than one-
hot encoding. Further, It reduces the curse of dimensionality for data with high cardinality.

Page 29 of 32
Base N Encoding
Before diving into BaseN encoding let’s first try to understand what is Base here?
In the numeral system, the Base or the radix is the number of digits or a combination of digits
and letters used to represent the numbers. The most common base we use in our life is 10 or
decimal system as here we use 10 unique digits i.e 0 to 9 to represent all the numbers.
Another widely used system is binary i.e. the base is 2. It uses 0 and 1 i.e 2 digits to express
all the numbers.
For Binary encoding, the Base is 2 which means it converts the numerical values of a
category into its respective Binary form. If you want to change the Base of encoding scheme
you may use Base N encoder. In the case when categories are more and binary encoding is
not able to handle the dimensionality then we can use a larger base such as 4 or 8.
#Import the libraries
import category_encoders as ce
import pandas as pd

#Create the dataframe


data=pd.DataFrame({'City':['Delhi','Mumbai','Hyderabad','Chennai','Bangalore','Delhi','Hyder
abad','Mumbai','Agra']})

#Create an object for Base N Encoding


encoder= ce.BaseNEncoder(cols=['city'],return_df=True,base=5)

#Original Data
data

#Fit and Transform Data


data_encoded=encoder.fit_transform(data)
data_encoded

Page 30 of 32
In the above example, I have used base 5 also known as the Quinary system. It is similar to
the example of Binary encoding. While Binary encoding represents the same data by 4 new
features the BaseN encoding uses only 3 new variables.
Hence BaseN encoding technique further reduces the number of features required to
efficiently represent the data and improving memory usage. The default Base for Base N is 2
which is equivalent to Binary Encoding.

Target Encoding
Target encoding is a Baysian encoding technique.
Bayesian encoders use information from dependent/target variables to encode the categorical
data.
In target encoding, we calculate the mean of the target variable for each category and replace
the category variable with the mean value. In the case of the categorical target variables, the
posterior probability of the target replaces each category..
#import the libraries
import pandas as pd
import category_encoders as ce

#Create the Dataframe


data=pd.DataFrame({'class':['A,','B','C','B','C','A','A','A'],'Marks':[50,30,70,80,45,97,80,68]})

#Create target encoding object


encoder=ce.TargetEncoder(cols='class')

#Original Data
Data

Page 31 of 32
#Fit and Transform Train Data
encoder.fit_transform(data['class'],data['Marks'])

We perform Target encoding for train data only and code the test data using results obtained
from the training dataset.

Page 32 of 32

You might also like