0% found this document useful (0 votes)
56 views25 pages

Unit 3 DATA VISUAIZATION

Uploaded by

x-factor
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
56 views25 pages

Unit 3 DATA VISUAIZATION

Uploaded by

x-factor
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 25

Unit-III:

DATA VISUALIZATION
Introduction
Data visualization is a critical aspect of data analysis and communication. It involves the use of graphical
representations to convey information and insights from data. Visualizations make complex datasets more
understandable, revealing patterns, trends, relationships, and anomalies that might not be immediately
apparent in raw data. Here are some key points about data visualization:

Importance of Data Visualization:

1. Clarity and Communication: Visualizations help simplify complex data and make it more
accessible to a wider audience. They facilitate effective communication of findings and insights.

2. Pattern Recognition: Visualizations make it easier to identify patterns, trends, and correlations in
data, which can lead to more informed decision-making.

3. Exploration: Visualizations allow you to explore data from different angles, enabling the
discovery of unexpected insights and relationships.

4. Storytelling: Visualizations can be used to tell a data-driven story, guiding viewers through a
narrative and helping them understand the context and significance of the data.

5. Quick Understanding: Visual representations often allow for quicker comprehension of data
compared to analysing raw numbers or textual descriptions.

Types of Data Visualization


1. Category count
Example: pie chart, bar chart, histograms, tree map, etc.
2. Relationship among variables
Examples: Scattered plot, line chart, area chart, etc.

More specific examples

1. Bar Chart
2. Pie Chart
3. Donut Chart
4. Half Donut Chart
5. Multi-Layer Pie Chart
6. Line Chart
7. Scatter Plot
8. Cone Chart
9. Pyramid Chart
10. Funnel Chart
11. Radar Triangle
12. Radar Polygon
13. Polar Graph
14. Area Chart
15. Tree Chart
16. Flowchart
17. Table
18. Geographic Map
19. Icon Array
20. Percentage Bar
21. Gauge
22. Radial Wheel
23. Concentric Circles
24. Gantt Chart
25. Circuit Diagram
26. Timeline
27. Venn Diagram
28. Histogram
29. Mind Map
30. Dichotomous Key
31. Pert Chart
32. Choropleth Map

Some of them explained in detail:

1. Pie chart
 A pie chart is a graph that shows the relative frequency distribution of a nominal variable.
 A pie chart is a circle that’s divided into one slice for each value. The size of the slices shows their
relative frequency.
 This type of graph can be a good choice when you want to emphasize that one variable is especially
frequent or infrequent, or you want to present the overall composition of a variable.
 A disadvantage of pie charts is that it’s difficult to see small differences between frequencies. As a
result, it’s also not a good option if you want to compare the frequencies of different values.

2. Bar chart
 A bar chart is a graph that shows the frequency or relative frequency distribution of a categorical
variable (nominal or ordinal).
 The y-axis of the bars shows the frequencies or relative frequencies, and the x-axis shows the values.
Each value is represented by a bar, and the length or height of the bar shows the frequency of the
value.
 A bar chart is a good choice when you want to compare the frequencies of different values. It’s much
easier to compare the heights of bars than the angles of pie chart slices.
3. Histogram
 A histogram is a graph that shows the frequency or relative frequency distribution of a quantitative
variable. It looks similar to a bar chart.
 The continuous variable is grouped into interval classes, just like a grouped frequency table. The y-
axis of the bars shows the frequencies or relative frequencies, and the x-axis shows the interval
classes. Each interval class is represented by a bar, and the height of the bar shows the frequency or
relative frequency of the interval class.
 A histogram is an effective visual summary of several important characteristics of a variable. At a
glance, you can see a variable’s central tendency and variability, as well as what probability
distribution it appears to follow, such as a normal, Poisson, or uniform distribution.

Although bar charts and histograms are similar, there are important differences:

4. Line Graph

A line graph reveals trends or progress over time, and you can use it to show many different categories of
data. You should use it when you chart a continuous data set.
5. Scatter Plot

A scatter plot is a type of data visualization that shows the relationship between different variables. This data
is shown by placing various data points between an x- and y-axis.

Essentially, each of these data points looks “scattered” around the graph, giving this type of data
visualization its name.

Scatter plots can also be known as scatter diagrams or x-y graphs, and the point of using one of these is to
determine if there are patterns or correlations between two variables.

The patterns or correlations found within a scatter plot will have a few different features.

 Linear or Nonlinear: A linear correlation forms a straight line in its data points while a nonlinear
correlation might have a curve or other form within the data points.
 Strong or Weak: A strong correlation will have data points close together while a weak correlation
will have data points that are further apart.
 Positive or Negative: A positive correlation will point up (i.e., the x- and y-values are both
increasing) while a negative correlation will point down (i.e., the x-values are increasing while the
corresponding y-values are decreasing).

6. Area Chart

An area chart or area graph is based on the line chart but is used to primarily communicate the summation of
data rather than represent individual data values, as in line charts. The area between axis and line is usually
emphasized with colours, textures or hatchings. The area underneath the line can help one to depict how data
progressed with time and can be an excellent way to compare values without going too deep.
o Discover Change, Proportion, Rank
o Compare and understand your data without drilling too deep

Types of Area Charts

1. Step area charts: Area charts which use vertical and horizontal lines to connect the data points in a
series forming a step-like progression.

2. Spline area charts: Area charts in which data points are connected by smooth curves instead of
straight lines.

3. Stacked area charts: Area charts which show how much each part contributes to the whole amount.
The category series are each plotted as an area and stacked on top of each other.

Bar Charts and Histograms: Used to display the distribution of categorical or numerical data.

Line Charts: Suitable for showing trends and changes over time, especially for continuous data.

Scatter Plots: Display individual data points as dots, often used to identify relationships between two
variables.

Pie Charts: Show parts of a whole, useful for representing proportions.


Heatmaps: Visualize data using colour intensity to represent values in a matrix.

Box Plots: Display data distribution and identify outliers and quartiles.
Area Charts: Similar to line charts but with the area below the line filled, useful for showing cumulative
data.
Tree chart: Hierarchical data in the format of tree, leaves and stem representing the work flow.

Tree Maps: Display hierarchical data as nested rectangles, where the size of each rectangle represents a
quantity.

Network Graphs: Illustrate relationships between entities using nodes and edges.

Geospatial Maps: Visualize data on geographical maps to show spatial patterns and distributions.
Pandas Visualization commands:
Pandas, a popular data manipulation library in Python, provides some basic built-in visualization
functionality that is built on top of Matplotlib. This allows you to create simple visualizations directly from
Pandas DataFrames and Series. Here are some common Pandas visualization commands:

1. Line Plot:

import pandas as pd
import matplotlib.pyplot as plt
data = {'x_values': [1, 2, 3, 4, 5], 'y_values': [10, 15, 7, 12, 8]}
df = pd.DataFrame(data)
df.plot(x='x_values', y='y_values', kind='line')
plt.show()

2. Bar Plot:

df.plot(x='x_values', y='y_values', kind='bar')


plt.show()

3. Horizontal Bar Plot:

df.plot(x='x_values', y='y_values', kind='barh')


plt.show()

4. Histogram:

df['y_values'].plot(kind='hist')
plt.show()

5. Scatter Plot:

df.plot(x='x_values', y='y_values', kind='scatter')


plt.show()

6. Box Plot:

df['y_values'].plot(kind='box')
plt.show()

7. Area Plot:

df.plot(x='x_values', y='y_values', kind='area')


plt.show()

8. Pie Chart:

data = {'categories': ['A', 'B', 'C'], 'values': [30, 40, 20]}


df = pd.DataFrame(data)
df.plot(y='values', labels=df['categories'], kind='pie')
plt.show()
9. Bar Plot with Multiple Columns:

df.plot(x='x_values', kind='bar')
plt.show()
These examples cover some of the basic visualization types you can create using Pandas. Remember that
you can customize the visualizations further by using Matplotlib functions to modify colors, labels, titles,
and more. While Pandas visualization is convenient for quick exploratory data analysis, for more complex
and customized visualizations, you might want to explore other libraries like Matplotlib, Seaborn, Plotly, or
others.

Matplotlib Visualization Commands


Matplotlib is one of the most widely used Python libraries for creating static, interactive, and publication-
quality visualizations. It provides a wide range of customization options and supports various types of plots
and charts. Here are some common Matplotlib visualization commands for creating different types of plots:

1. Line Plot:

import matplotlib.pyplot as plt


x = [1, 2, 3, 4, 5]
y = [10, 15, 7, 12, 8]
plt.plot(x, y, marker='o')
plt.xlabel('X values')
plt.ylabel('Y values')
plt.title('Line Plot')
plt.show()

2. Scatter Plot:

plt.scatter(x, y, marker='o', color='blue')


plt.xlabel('X values')
plt.ylabel('Y values')
plt.title('Scatter Plot')
plt.show()

3. Bar Plot:

categories = ['A', 'B', 'C']


values = [30, 40, 20]
plt.bar(categories, values)
plt.xlabel('Categories')
plt.ylabel('Values')
plt.title('Bar Plot')
plt.show()

4. Histogram:

data = [10, 15, 7, 12, 8, 25, 30, 18, 22]


plt.hist(data, bins=10, edgecolor='black')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.title('Histogram')
plt.show()

5. Box Plot:

data = [10, 15, 7, 12, 8, 25, 30, 18, 22]


plt.boxplot(data)
plt.ylabel('Value')
plt.title('Box Plot')
plt.show()

6. Pie Chart:

labels = ['A', 'B', 'C']


sizes = [30, 40, 20]
plt.pie(sizes, labels=labels, autopct='%1.1f%%', startangle=140)
plt.axis('equal')
plt.title('Pie Chart')
plt.show()

7. Area Plot:

plt.fill_between(x, y, color='skyblue', alpha=0.4)


plt.xlabel('X values')
plt.ylabel('Y values')
plt.title('Area Plot')
plt.show()

These are just a few examples of the types of visualizations you can create using Matplotlib. The library
offers extensive customization options, so you can modify colors, styles, labels, legends, and more to tailor
your visualizations to your specific needs.

Seaborn Visualization Commands


Seaborn is a powerful data visualization library built on top of Matplotlib. It provides a higher-level
interface for creating aesthetically pleasing and informative statistical graphics. Seaborn simplifies the
process of creating complex visualizations and supports a variety of plot types. Here are some common
Seaborn visualization commands:

1. Importing Seaborn:

import seaborn as sns


import matplotlib.pyplot as plt

2. Set Style and Context:

sns.set(style="whitegrid")
sns.set_context("notebook")

3. Line Plot:
x = [1, 2, 3, 4, 5]
y = [10, 15, 7, 12, 8]
sns.lineplot(x=x, y=y, marker='o')
plt.xlabel('X values')
plt.ylabel('Y values')
plt.title('Line Plot')
plt.show()

4. Scatter Plot:

sns.scatterplot(x=x, y=y, marker='o', color='blue')


plt.xlabel('X values')
plt.ylabel('Y values')
plt.title('Scatter Plot')
plt.show()

5. Bar Plot:

categories = ['A', 'B', 'C']


values = [30, 40, 20]
sns.barplot(x=categories, y=values)
plt.xlabel('Categories')
plt.ylabel('Values')
plt.title('Bar Plot')
plt.show()

6. Histogram:

data = [10, 15, 7, 12, 8, 25, 30, 18, 22]


sns.histplot(data, bins=10, kde=True)
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.title('Histogram')
plt.show()

7. Box Plot:

data = [10, 15, 7, 12, 8, 25, 30, 18, 22]


sns.boxplot(data=data)
plt.ylabel('Value')
plt.title('Box Plot')
plt.show()

8. Violin Plot:

sns.violinplot(data=data)
plt.ylabel('Value')
plt.title('Violin Plot')
plt.show()

9. Pair Plot:
iris = sns.load_dataset("iris")
sns.pairplot(iris, hue="species")
plt.show()

10. Heatmap:

data = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]


sns.heatmap(data, annot=True, cmap="YlGnBu")
plt.title('Heatmap')
plt.show()

These examples cover a variety of visualization types you can create using Seaborn. The library provides
extensive functionality for customization, colour palettes, and statistical enhancements, allowing you to
create informative and visually appealing visualizations with ease.

Data for visualization


Data types

o There are three basic types of data: something you can count, something you can order and
something you can just differentiate.
o As often is the case, these types get down to three un-intuitive terms:

Quantitative
Anything that deals with exact numbers is called quantitative data.
For example, Effort in points: 0, 1, 2, 3, 5, 8, 13.
Duration in days: 1, 4, 666.

Ordered / Qualitative
Anything that can be compared and ordered is qualitative data.
User Story Priority: Must Have, Great, Good, Not Sure.
Bug Severity: Blocking, Average, Who Cares.

Categorical
Everything else that can be differentiated and grouped is called categorical.
Entity types: Bugs, Stories, Features, Test Cases.
Fruits: Apples, Oranges, Plums.

Different types of Data Visualizations are:


o Plotting Qualitative Variables
Potting graphs for categorical variables
 The categorical data consists of categorical variables which represent the characteristics
such as a person’s gender, hometown etc. Categorical measurements are expressed in
terms of natural language descriptions, but not in terms of numbers. Sometimes
categorical data can take numerical values, but those numbers do not have mathematical
meaning.
 Some of the examples of the categorical data are as follows:
 Birthdate
 Favourite sport
 School Postcode
 Travel method to school etc.

 When you observe the above example, birthdate and postcode contain numbers. Even
though it contains numerals, it is considered as categorical data. The easy way to
determine whether the given data is categorical or numerical data is to calculate the
average.
 If you are able to calculate the average, then it is considered to be a numerical data, then it
is considered to be a categorical data. Like the example mentioned above, the average of
birthdate and the postal code has no meaning, so it is taken as categorical data.
 In statistics, a categorical variable is a variable that contains limited, and usually a fixed
number of possible values. They take values which are normally names or labels.
Examples are:
 The colour of a wall, like red, blue, pink, green, etc.,
 Gender of people, like male, female and transgender
 Blood group of a person: A, B, O, AB, etc.,

 These variables are used to assign each individual or another unit of observation to a
particular group or nominal category based on some qualitative property.
 Generally, each of the potential values of a categorical variable is said to be as a level.
The probability distribution linked with a random categorical variable is known as
categorical distribution.
 Can be visualized using only bar graphs and pie charts.

o Plotting Quantitative Variables


Plots graphs for numerical variables.
 Numerical data refers to the data that is in the form of numbers, and not in any language
or descriptive form.
 Often referred to as quantitative data, numerical data is collected in number form and
stands different from any form of number data types due to its ability to be statistically
and arithmetically calculated.
 It doesn’t involve any natural language description and is quantitative in nature and it is
used to measure quantities like a person’s height, age, IQ, etc.
 Can be visualized using bar graphs, pie charts as well as scatter plots.

Here's an example dataset that you can use for visualization practice. This dataset consists of fictional sales
data for a company over a period of time. You can use this data to create various types of plots and
visualizations.

import pandas as pd
# Sample sales data
data = {
'Date': ['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-04',
'2023-01-05'],
'Product': ['A', 'B', 'A', 'C', 'B'],
'Revenue': [1000, 1500, 1200, 800, 2000],
'Profit': [200, 300, 250, 150, 400]
}
df = pd.DataFrame(data)
df['Date'] = pd.to_datetime(df['Date'])
print(df)

 This dataset includes the following columns:

- Date: The date of the sales transaction.


- Product: The product being sold (categorical).
- Revenue: The revenue generated from the sale.
- Profit: The profit generated from the sale.
We can use this dataset to create various visualizations, such as line plots to visualize trends over time, bar
plots to compare revenues and profits for different products, scatter plots to explore the relationship between
revenue and profit, and more.

We and manipulate and expand this dataset to suit your visualization needs and practice different types of
plots and charts.

Choosing the Right Visualization with respect to the type of data:

Selecting the appropriate visualization depends on the nature of the data and the insights you want to
convey:

- Use bar charts for comparisons and categorical data.


- Use line charts for time-based trends.
- Use scatter plots to show relationships and correlations.
- Use pie charts for simple proportional representations.
- Use heatmaps for data matrices with colour-encoded values.
- Use maps for geospatial data and location-based insights.

Design Principles:

1. Simplicity: Keep the visualization clean and uncluttered for easy interpretation.
2. Labels and Titles: Clearly label axes, data points, and provide a descriptive title.
3. Colour Use: Use colours effectively to highlight data points or categories, but avoid excessive
colour usage.
4. Consistency: Maintain consistent design elements across different visualizations.
5. Context: Provide context and explanations to help viewers understand the visualization's
significance.

Remember that the choice of visualization should align with your data analysis goals and the audience you're
trying to reach. Effective data visualization enhances understanding and facilitates better decision-making.

Data visualization tools


1. Tableau: Tableau is a widely-used data visualization tool that allows users to create interactive and
shareable dashboards, reports, and charts. It supports a wide range of data sources and offers various
visualization options.

2. Power BI: Microsoft Power BI is another popular choice for creating interactive data visualizations and
reports. It integrates well with Microsoft products and offers extensive customization and sharing options.

3. matplotlib: matplotlib is a widely-used Python library for creating static, animated, and interactive
visualizations in a variety of formats and styles. It's particularly popular for creating plots within Jupyter
notebooks.

4. Seaborn: Seaborn is built on top of matplotlib and provides a higher-level interface for creating attractive
statistical visualizations. It's well-suited for tasks like visualizing distributions, correlations, and categorical
data.

5. ggplot2: ggplot2 is an R package that follows the Grammar of Graphics philosophy. It allows you to
create complex and customizable visualizations by combining simple building blocks.

6. D3.js: D3.js (Data-Driven Documents) is a JavaScript library for creating dynamic and interactive data
visualizations in web browsers. It gives you full control over the visual elements and is widely used for
custom visualizations.

7. Plotly: Plotly is a versatile data visualization library available in Python, R, and JavaScript. It supports a
wide range of chart types, including interactive plots, 3D charts, and maps.

8. QlikView/Qlik Sense: QlikView and Qlik Sense are business intelligence platforms that offer interactive
data visualization and exploration capabilities. They allow users to create dashboards and reports with
intuitive drag-and-drop interfaces.

9. Google Data Studio: Google Data Studio is a free tool that enables users to create interactive and
customizable dashboards and reports using data from various sources, including Google Analytics and
Google Sheets.

10. Periscope Data: Periscope Data is a data analysis and visualization platform that allows users to create
real-time dashboards and charts from various data sources. It's often used by data analysts and data
scientists.

11. Infogram: Infogram is a web-based tool that's particularly useful for creating infographics, charts, and
interactive visual content. It's user-friendly and doesn't require coding skills.

12. Adobe Illustrator / Adobe XD: While not specific to data visualization, Adobe Illustrator and Adobe
XD offer powerful capabilities for creating custom visualizations and infographics, especially if you're
looking for precise control over design.

These tools offer a range of features and capabilities, so the choice of tool depends on your data, your goals,
your familiarity with programming languages, and the level of interactivity you require in your
visualizations.

Python Libraries for Data Visualization


There are several powerful Python libraries for data visualization that offer a wide range of tools to create
various types of charts, graphs, plots, and interactive visualizations. Some of the most popular libraries
include:

1. Matplotlib: This is a fundamental plotting library that provides a wide variety of 2D and 3D plots. It
serves as the foundation for many other visualization libraries and provides a lot of customization options.

2. Seaborn: Built on top of Matplotlib, Seaborn offers a higher-level interface for creating attractive and
informative statistical graphics. It simplifies the creation of complex visualizations like heatmaps, violin
plots, and pair plots.

3. Pandas Visualization: The Pandas library, used for data manipulation and analysis, also provides simple
plotting methods that integrate well with Pandas DataFrames and Series.

4. Plotly: This library allows you to create interactive, web-based visualizations. It supports a variety of
chart types and can be used both as an offline library and in online environments.

5. Bokeh: Another interactive visualization library, Bokeh focuses on creating interactive and dynamic
visualizations in web browsers. It's particularly useful for building interactive dashboards.

6. Altair: Altair is a declarative statistical visualization library that allows you to create a wide range of
visualizations using a concise and intuitive syntax. It's based on the Vega-Lite visualization grammar.

7. ggplot: Inspired by R's ggplot2, this library provides a high-level interface for creating complex and
aesthetically pleasing visualizations. It's built on top of Matplotlib and provides a more structured approach
to plotting.

8. Holoviews: Holoviews is designed to simplify the process of creating complex visualizations by allowing
you to work with data at a higher semantic level. It's particularly useful for creating dynamic visualizations.

9. WordCloud: If you need to visualize textual data, WordCloud is a library that helps you generate visually
appealing word clouds from text data.

10. NetworkX: If you're working with network or graph data, NetworkX is a library that provides tools for
the creation, manipulation, and visualization of complex networks.

These libraries cater to different levels of complexity, interactivity, and customization. Your choice of
library will depend on your specific needs and preferences. It's also worth noting that some of these libraries
can be used in conjunction with each other to achieve specific visualization goals.

The main point we would like to make is that the way in which variables which are visualized should always
be adapted to the variable types; for example, qualitative data should be plotted differently from quantitative
data.

 Importing, Summarizing, and Visualizing Data using python:

 import matplotlib.pyplot as plt


 import pandas as pd
 import numpy as np

Before doing the above we have to install the required libraries as required using the command prompt.

 pip install matplotlib


 pip install pandas
 pip install numpy

 Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations in
Python.
 matplotlib.pyplot is a collection of functions that make matplotlib work like MATLAB. Each pyplot
function makes some change to a figure: e.g., creates a figure, creates a plotting area in a figure, plots
some lines in a plotting area, decorates the plot with labels, etc.

Data encodings
 Encoding is the process of converting the data or a given sequence of characters, symbols, alphabets
etc., into a specified format, for the secured transmission of data.
 Decoding is the reverse process of encoding which is to extract the information from the converted
format.
 Data encoding refers to the process of converting information from one format or representation into
another format so that it can be easily stored, transmitted, or processed by computer systems.
Encoding is essential in various areas, including computer science, communication, and data storage.
It's used to ensure that data is correctly interpreted and preserved throughout its journey from one
point to another.

Encoding Techniques

Data encoding is a crucial aspect of data science and machine learning that involves transforming
categorical or textual data into a numerical format that can be effectively used by algorithms for analysis
and modelling. Since many machine learning algorithms require numerical input, encoding is necessary
to handle features that are not inherently numeric, such as categories or textual data.

There are several common techniques for data encoding:

1. Label Encoding: This involves assigning a unique integer value to each category. While it's simple,
it might not be suitable for algorithms that assume ordinal relationships between categories, as the
numerical values don't hold any inherent meaning.
2. One-Hot Encoding: In this method, each category is converted into a binary vector. Each bit in the
vector corresponds to a category, and the bit for the respective category is set to 1 while the others
are set to 0. This avoids ordinal assumptions and works well for nominal categorical data. In this
technique for N categories in a variable, it uses N binary variables.
3. Dummy Encoding: Dummy coding scheme is similar to one-hot encoding. This categorical data
encoding method transforms the categorical variable into a set of binary variables (also known as
dummy variables). The dummy encoding is a small improvement over one-hot-encoding. Dummy
encoding uses N-1 features to represent N labels/categories.
4. Effect Encoding: This encoding technique is also known as Deviation Encoding or Sum Encoding.
Effect encoding is almost similar to dummy encoding, with a little difference. In dummy coding, we
use 0 and 1 to represent the data but in effect encoding, we use three values i.e. 1, 0, and -1. The row
containing only 0s in dummy encoding is encoded as -1 in effect encoding.
5. Ordinal Encoding: This is used when categories have an inherent order. The categories are mapped
to integers in a way that preserves the order. This method is suitable for data where the categories
have a clear ranking, like "low," "medium," and "high."
6. Binary Encoding: This combines aspects of one-hot encoding and binary representation. Each
category is assigned a unique binary pattern, and these patterns are used as features. This can help
manage memory and computational resources when there are many categories. This technique
involves first converting the categories to integers and then encoding those integers into binary code.
Each binary digit forms a separate feature.
7. Hash Encoder
Hash encoders are suitable for categorical variables with a large number of levels. It has a lot of
different compromises, but scales extremely well. The user specifies the number of binary output
columns that they want as output.
The central part of the hashing encoder is the hash function, which maps the value of a category into
a number.

For example, a (bad) hash function might treat "a=1", "b=2", and sum all the values together of the label
together.

Solution: hash("critic") = 3 + 18 + 9 + 20 + 9 + 3 = 62 # bad hash function

Because we are not memorizing the different levels, it deals with new levels gracefully. If we said
that we wanted 4 binary features, we can take the value written in binary, and select the lowest 4 bits.
For example, hash("critic") = 62, and in binary 62=0b111110; taking the lower 4 bits would give the
values 1, 1, 1, 0.
8. Frequency Encoding: Categories are replaced with their frequency of occurrence in the dataset. This
can be useful when the frequency of a category might be indicative of its importance.
9. Target Encoding (Mean Encoding): For each category, the target variable's mean or some other
statistical measure is calculated and used as the encoded value. This can be powerful for categorical
features with predictive power.
10. Embedding: Commonly used in natural language processing (NLP), this involves representing
categorical values as dense vectors in a continuous space. Word embeddings like Word2Vec and
GloVe are examples of this technique.

It's important to note that the choice of encoding method depends on the nature of the data, the algorithm
being used, and the specific problem at hand. Each method has its own advantages and limitations.
Additionally, it's essential to handle encoding consistently across training and testing data to ensure
accurate model performance.

In Python, libraries like scikit-learn, pandas, and TensorFlow provide functions and tools to perform various
types of data encoding efficiently. Always consider the characteristics of your data and the requirements of
your machine learning algorithm when selecting an encoding method.

10 key terms in data visualisation

1. Format

Interactive visualisations allow you to modify, manipulate and explore a


computer-based display of data. The vast majority of interactive visualisations
are found on websites but increasingly might also exist within apps on tablets
and smartphones. By contrast, a static visualisation displays a single, non-
interactive display of data, often with the aim for it to be viewed in print as well
as on a screen.

2. Chart type

Charts are individual visual representations of data. There are many ways of
representing your data, using different marks, shapes and layouts: these are all
called types of charts. Some chart types you might be familiar with, such as the
bar chart, pie chart or line chart, whilst others may be new to you, like the sankey diagram, tree map,
choropleth map. See the section called ‘Taking time with visualisation’ for more on chart types.

3. Dataset

A dataset is a collection of data upon which a visualisation is based. It is useful


to think of a dataset as taking the form of a table with rows and columns,
usually existing in a spreadsheet or database. The rows are the records –
instances of things – and the columns are the variables – details about the
things. Datasets are visualised in order to ‘see’ the size, patterns and
relationships that are otherwise hard to observe.

4. Data source

When visualizers want to show you where the data or information comes from, they will include it in the
visualisation. Sometimes it appears near the title or the bottom of the page. Other times, if the visualisation
comes with an article, you can find it in the accompanying text.

5. Axis
Many types of chart have axes. These are the lines that go up and down (the
vertical Y axis), or left and right (the horizontal X axis), providing a reference
for reading the height or position of data values. Axes are the place where
you will usually see the scale (see below) providing a stable reference point
against which you form your reading of the chart.

6. Scale
Scales are marks on a visualisation that tell you the range of values of data
that is presented. Scales are often presented as intervals (10, 20, 30 etc.) and
will represent units of measurement, such as prices, distances, years, or
percentages.

7. Legend
Many charts will use different visual properties such as colours, shapes or
sizes to represent different values of data. A legend or key tells you what
these associations mean and therefore helps you to read the meaning from
the chart.

8. Variables
Variables are the different items of data held about a ‘thing’, for
example it might be the name, date of birth, gender and salary of an
employee. There are different types of variables, including
quantitative (e.g. salary), categorical (e.g. gender), others are
qualitative or text-based (e.g. name). A chart plots the relationship
between different variables. For example, the bar chart to the right
might show the number of staff (height of bar), by department
(different clusters) broken down by gender (different colours).

9. Outliers
Outliers are those points of data that are outside the normal range of data
in some way. Visualisations can often help to identify patterns in the data
– in the example on the right, the higher the number on the x axis, the greater the number on the y
axis. Sometimes individual bits of data don’t fit in to the pattern, like the orange dot here; those are
the outliers.

10. Input area


Input areas allow you to enter information into a visualisation, maybe
to search for certain names or places, or to input information about
yourself that will be used in the visualisation.

Data encoding in data visualization refers to the process of mapping the


attributes or characteristics of data to visual elements in a way that effectively
communicates information. In other words, it's about deciding how to
represent the different aspects of your data using visual properties such as
position, size, colour, shape, and texture. Proper data encoding ensures that the visual representation
accurately conveys the underlying information and insights to the viewer.

Visual encodings
Visual encoding refers to the process of representing data and information using visual elements, such as
shapes, colours, sizes, and positions, in order to create meaningful visual representations that can be easily
interpreted by humans. It's a fundamental concept in data visualization and information design, as it allows
complex and abstract data to be transformed into visual forms that convey insights, patterns, and
relationships.

Effective visual encoding involves choosing appropriate visual variables and mapping data attributes to
these variables in a way that accurately communicates the intended message. Here are some key visual
variables often used in visual encoding:

1. Position: The location of visual elements on a two-dimensional plane can be used to encode data values.
For example, on a bar chart, the height of the bars along an axis corresponds to the data values.

2. Length or Size: The length or size of visual elements, such as bars or circles, can represent data values.
Larger lengths or sizes generally correspond to larger data values.

3. Shape: Different shapes can be used to represent categories or data values. For instance, different types of
symbols can be used on a scatter plot to differentiate between different data points.

4. Colour: Colours can be used to represent categories or to encode quantitative values. Colour intensity,
hue, and saturation can all be manipulated to convey different meanings.

5. Texture or Pattern: Different textures or patterns can be used to distinguish between categories or data
points. However, this method should be used carefully, as it can sometimes introduce confusion or difficulty
in interpretation.

6. Orientation: The orientation of visual elements, such as bars or lines, can also encode data values. For
example, the angle of a line can represent a specific measurement.

7. Opacity or Transparency: Varying the transparency of visual elements can be used to show overlapping
data points or to indicate density in certain regions.
8. Connection or Link: Connecting lines or arrows can be used to show relationships or flows between
different data points or entities.

Effective visual encoding depends on understanding the characteristics of the data and choosing the
appropriate visual variables that best convey the intended message without introducing ambiguity. It's also
important to consider principles of visual perception, such as pre-attentive attributes, which are features that
the human visual system can quickly perceive, aiding in the rapid identification of patterns and anomalies.

Overall, visual encoding is a powerful tool for transforming data into visual forms that facilitate better
understanding, decision-making, and communication of complex information.

 The visual encoding is the way in which data is mapped into visual structures, upon which we build
the images on a screen.

 There are two types of visual encoding variables:


o Planar
o Retinal.
 Planar variables are known to everybody. If you’ve studied maths, you’ve been drawing graphs
across the X- and Y-axis. Planar variables work for any data type. They work great to present any
quantitative data.
 It’s a pity that we have to deal with the flat screens and just two planar variables. Well, we can try to
use Z-axis, but 3D charts look horrible on screen in 95.8% of cases.
 Humans are sensitive to the retinal variables. They easily differentiate between various colours,
shapes, sizes and other properties.

Retinal variables
 Retinal variables were introduced by Bertin about 4 decades ago, and this concept has become quite
popular recently. While there’s some critique about the effectiveness of retinal variables, most
specialists find them useful.
 Visual implantations need retinal variables to be encoded, and retinal variables take visual
parameters.
 For example, a point visual implantation can be encoded using the shape of a hollow circle and the
colour blue. A line can be encoded using a solid pattern of thick size and green colour. An area can
be encoded using a 20% transparent red colour and thin line borders.
 Bertin (1967) describes visual implantations as dimensionless elements with underlying coordinates
that necessitate the encoding of retinal variables in order to become visually informative. He
identifies six fundamental retinal variables:
1. Colour hue (e.g. blue, green, magenta).
2. Colour value (lightness vs. darkness).
3. Size (e.g. large, small, thick, thin).
4. Shape (e.g. circle, rectangle, diamond).
5. Orientation (e.g. angle, degrees).
6. Texture (e.g. dashed lines, polka dots).
7. Motion (animation)

 The eye is independently sensitive to these retinal variables, which means that more than one retinal
variable can be deployed at the same time in order to encode different variation in the data.
 The following is a figure from Bertin describing the implementation of retinal variables in
conjunction with visual implantations:

 Retinal variables encode visual implantations (points, lines, areas) and can be used to represent
differences (≠), similarities (≡), a quantified order (Q), or a qualitative order (O). For example,
representing differences between the categories of Yes and No when responding to a survey is best
done with the retinal variables of colour, orientation and then shape (notice the varying size of the
unequal symbol ≠ indicating difference in perceptual understanding). Representing differences
between Yes and No using the retinal variable of size would be wrong, e.g. large circles for Yes and
small circles for No, because then our perceptual system would be led to believe that Yes is more
important than No.
 There is a seventh retinal variable, not listed by Jacques Bertin, which is:
 We can detect motion and perceive differences in speed as an independent piece of information by
comparison to the other six retinal variables listed above.
 This set of seven retinal variables are today referred to as ‘visual channels’, and the visual
implantations are referred to as ‘marks’. For example, in below Figure from Tamara Munzner’s
(2014) excellent Visualization Analysis and Design:
 Since the Graph Workflow has been inspired directly from the work of Jacques Bertin, I prefer to use
his original terminology and classification, as translated by Howard Wainer.
 Recall the incomplete case study that is considered in the discussion of visual implantations. Here is
again that incomplete graph:

 This case study demonstrates how to move from the point, to the line and finally to the area
implantation each time encoding an additional piece of information. But using points, lines and areas
can only get as this far: three pieces of information.
 To encode additional information we need to consider encoding variations in points, lines and areas
and this is where retinal variables come into play. We are unable to distinguish between prices
falling or rising within a day because the all points, both line and both areas are encoded with
identical retinal variables.
 Specifically, the left-hand side graph above uses point implantations to encode the position of each
price, but is very much uninformative because it does not distinguish between close, open, high and
low prices. The retinal variable of shape can help distinguish between these categories, e.g. by
showing the open price as an open circle, the close price as a filled circle, the low price as an open
square and the high price as a filled square. This solution is shown in the left hand-side graph below.
It is an improvement but still lacks considerable decoding efficiency because of the excessive mental
effort required to comprehend and remember the role of every symbol.
 The middle graph above uses the point implantation as well as the line implantation to encode the
connection between open-close and low-high, but is also uninformative because the direction of the
two lines is the same and it is impossible to distinguish the length of each line. The retinal variable of
size could help by placing a larger weight (width) in one of the lines, so that it appears thicker than
the other. The point implantations now become redundant as the thicker lines can also help decode
position. This solution is shown in the middle graph below.
 The right-hand side graph above uses the area implantation to encode the range of open-close prices
which is an important piece of information in this context, as the difference between the closing price
and opening price gives the return earned (if positive) or suffered (if negative). In fact, the thicker
line solution in the middle graph already resembles to a rectangle area. So, we could make this
design step more explicit and replace the thicker line that encodes the connection between open-close
with a more obvious rectangle area. This shift to area gives the opportunity to employ the colour
retinal variable to condition areas that indicate increase in navy colour, and areas that indicate
decrease in red colour.

 Indeed, the right hand-side graph is a well-known solution in financial technical analysis, and is used
by high frequency traders and robotic algorithms to make quick decisions about selling or buying
securities. This is known as the candlestick chart.
 The point of this case study is to demonstrate how one can develop any data graph by starting by
encoding the point visual implantation on a coordinate system, then move to lines and areas, and then
condition the visual implantations using retinal variables in order to encode additional information.

Mapping variables to encodings


Mapping variables to encodings involves converting categorical or symbolic variables into numerical
representations, known as encodings or embeddings. In various machine learning and deep learning tasks,
algorithms usually require numerical input data. However, real-world data often contains categorical
features (variables) that are not directly usable in their raw form. Mapping these categorical variables to
numerical encodings is essential for feeding them into machine learning models.

The process of mapping variables to encodings involves assigning unique numerical values to different
categories or labels present in the categorical variables. This transformation enables the model to capture
relationships and patterns within the data, as numerical representations can be processed mathematically.

Here's an example to illustrate the concept:

Suppose you have a dataset of animals with a categorical variable "Animal Type" that includes labels like
"Dog," "Cat," and "Bird." To use this variable in a machine learning model, you need to map these labels to
numerical encodings:

- Dog: 1

- Cat: 2

- Bird: 3
This way, the categorical variable "Animal Type" is transformed into numerical encodings that the model
can work with.

Different encoding techniques, as mentioned in the previous response, can be used for this purpose. These
techniques provide strategies for converting categorical variables into numerical values, and the choice of
technique depends on factors such as the nature of the data, the type of model you're using, and the specific
goals of your analysis.

In summary, mapping variables to encodings is a crucial pre-processing step in preparing categorical data
for machine learning algorithms. It allows algorithms to work with a wider range of data and extract
meaningful insights and patterns from the information present in categorical variables.

Here are some common mapping techniques for data encoding, especially for handling categorical variables
in machine learning and data analysis:

1. Label Encoding:

Assign a unique integer to each category in the categorical variable. This is suitable when there's an
inherent ordinal relationship between the categories.

2. One-Hot Encoding:

Create binary columns (also known as dummy variables) for each category. Each column represents a
category, and a value of 1 or 0 indicates if the original data point belongs to that category. This is useful
when categories have no inherent order.

3. Ordinal Encoding:

Assign integer values to categories based on their order. This is suitable when categories have a clear and
meaningful ordinal relationship.

4. Binary Encoding:

Convert each integer value to its binary representation and create binary columns from those bits. This is
useful for ordinal variables with large numbers of categories.

5. Feature Hashing:

Hash the categorical values into a fixed number of dimensions. While it can be memory-efficient, hash
collisions could occur where different categories are hashed to the same value.

6. Target Encoding (Mean Encoding):

Replace categorical values with the mean of the target variable for that category. This can be useful in
classification tasks but may lead to overfitting if not handled carefully.

7. Frequency Encoding (Count Encoding):

Replace categorical values with their frequency or count in the dataset. This can provide information about
the popularity of categories.

8. Entity Embeddings:

Used in deep learning, learn continuous vector representations (embeddings) for categorical values. These
embeddings capture relationships between categories.
9. Leave-One-Out Encoding:

Similar to target encoding, but instead of the mean, you use the mean of the target variable excluding the
current data point.

10. Helmert Coding:

Contrast coding where each level of a categorical variable is compared to the mean of the subsequent
levels.

11. Sum Encoding (Deviation Encoding):

Compare each level to the overall mean of the categorical variable.

12. Backward Difference Encoding:

Each level is compared with the mean of the subsequent levels minus the mean of all the levels that
follow it.

The choice of encoding technique depends on the type of categorical variable, the nature of the data, and the
algorithm you intend to use. It's also important to consider potential issues such as handling new categories,
avoiding multicollinearity, and the interpretability of the encoded data.

You might also like