Module 2e - Data Visualization - Nv
Module 2e - Data Visualization - Nv
UP Open University
Introduction
In this module, you will learn the different types of visualization, and how to
visualize numerical and non-numerical data.
Visualization is the process of converting raw data into graphical forms to facilitate in
understanding the characteristics and in interpreting results of analysis. With visualization, we
can see if there are trends, what are the data patterns, we can identify if the data is normally
distributed or not, if outlier values are present in the data set, helps us understand the clusters
in the dataset, and many more. When correctly done, visualization can aid decision makers of
data-driven organizations make fast and accurate decisions.
Visualization can be done at any stage of data analytics process such as during the
exploratory data analysis, model building and in the presentations of results.
Visualizations serve as the communication tool between data experts and the
public. Graphical presentations of data make them easier to comprehend than
columns and rows of data. Visualizations allows multiple presentations of single data
set. This means that certain features in the dataset can be presented as bar chart,
while other attributes can be presented as pie chart.
Visualization can be applied to both numerical and non-numerical data. For our
demonstration, we will use the sample data provided in the Orange software.
Run/execute the Orange Data Mining software and click the Examples widget, and
choose Visualization of Data Sets.
To understand about the data set, double click the File Widget Table to know about
the file associated with the File Widget. As shown in the info sheet, the file is about
the Iris Flower Dataset (iris.tab) with five attributes (4 numerical, 1 text) and 150
instances or records.
Double click Data Table widget to view the data set or the Scatter Plot widget to
view the relationship between two of the variables in the data set. The
a. Distribution Widget. This tool is used to display or view the distribution of data of a
single variable. This can be done by adding the Distributions Widget and connect it
from the File Widget. Then, double click the Distributions Widget to view the frequency
distribution.
The distribution below shows the distribution of Sepal length of all Iris flowers.
With reference to the horizontal axis, the values of this variable is distributed
from 4 to 8. At the upper right of the chart is the mean sepal length of 5.743
and standard deviation of 0.825. The fitted distribution is set to Normal with a
bin width of 0.5. You can adjust the bin width according to your preference.
To view the distribution of Sepal width, just select the variable Sepal width to
display its distribution.
To display the distributions of sepal length for each type of flower, click the
pull down button Split by and select Iris. The result of this action is shown
below:
b. Box plot. This chart shows the summary value of numeric variables. These summary
values include first quartile, median, third quartile, mean and standard deviation. It also
allows you to show the boxplots of the variable grouped according to its associated
categorical variable. Box plot is a graphical approach for univariate exploratory data
analysis.
Using the same data set, click and drag Box Plot to the Canvas and connect it
from the File Widget.
Double click the Box Plot widget to view the chart of the selected variable.
The center yellow vertical line with value 5.8 refers to the median value, while
5.843 and 0.83 refer to the mean and standard deviation, respectively. Since
the mean value (5.83) is greater than the median value (5.8), this implies that
this variable is right-skewed distribution (or has long right tail). The value 5.1
and 6.4 refer to the 1st quartile and 3rd quartile of the data set.
To view the box-plot of the variable sepal length per type of Iris flower, click
Iris in the subgroups.
c. Line graph. This is used to find trends or patterns on the data over a period of
time. This is one of the easiest to implement and this can be done easily with MS
Excel.
Data type: Both x and y axes are quantitative
Example: humidity over time, income over months
d. Pie Chart. This chart is typically used to show the proportions of values relative to
the total value. This is best done using MS Excel Software application. For example,
assuming that we have 30 records in our dataset and we have 13 male respondents,
and 17 respondents, their proportions or percentages can be viewed graphically
using pie chart as follows:
13, 43%
17, 57%
F M
e. Scatter plot. This is used to plot the points of two variables in a cartesian plane and
to visualize if there is a pattern or relationship on the values of the two variables.
This relationship could be positive, negative or none at all. Both variables at the x
and y -axes should be quantitative data. Scatterplot is used as a graphical method in
conducting multivariate/ bivariate exploratory data analysis.
In this example, let us visualize the relationship between the petal width and
petal length of iris flower.
Since there are three types of Iris flower (Iris-Setosa, Iris-Versicolor, and Iris-
Virginica), we can check the Show Legend to help us understand where
these points belong. Let us also check the Show Regression Line to view
the linear relationship of the variables.
From our scatterplot, the red, green and blue colored dots represent the
intersections in the cartesian plane of the petal width and petal length of Iris-
versicolor, Iris-virginica and Iris-setosa, respectively.
The different colored lines are the regression lines and correlation coefficients (r
values) of the petal length and petal width under each type of iris flower. The r-values
of Iris-virgnica, Iris-versicolor and Iris-setosa are 0.32, 0.79 and 0.31, respectively.
On the other hand, the black line and r = 0.96 represent the overall regression line
and correlation coefficient of the petal length and petal width. The range of r values
is within -1 and +1. An r value
An r-value that is close to 1 indicates a strong positive linear relationship between the
two variables while an r-value that is close to -1 indicates a strong negative linear
relationship between the two variables. An r-value that is close to zero means that
the relationship between the two variables is linearly weak.
The r value at 0.96 implies that the overall relationship between sepal length and
petal width has a strong positive correlation. In other words, the longer is the sepal
length, the wider is the petal width.
Visualization Tools
There are many visualization tools that are available either offline or online. These
include the following:
1. MS Excel. Aside from its ability to handle columnar data, this application is also
capable of visualizing data and in performing statistical analysis.
2. Tableau. This application is one of the popular visualization software these days.
Its public edition can be downloaded and used for free.
3. Orange Data Mining. Orange is a machine learning and data mining software that
has various features to visualize data. It also has various statistical and machine
learning algorithms. It is open source and uses interactive graphical tools to use its
functions.
4. Rapidminer. This is a machine learning software that could perform various data
analysis and visualization. It also has free version, which allows analysis of up to
10,000 records.
5. Google Charts. This is one of the applications offered by google for free.
6. Datawrapper. This is an online charting tool for creating charts and maps.
7. Infogram. This is an online visualization tool that allows you to create infographics
and reports.
8. Online Data Visualization Websites
Course code: COMP ED 20 Page | 8
Calag, VB (2021). Introduction to Analytics. Los Baños: University of the Philippines Open University
Assignment 4. 35 pts.