@vtucode - in 21CS644 Module 4 2021 Scheme
@vtucode - in 21CS644 Module 4 2021 Scheme
Vtucode.in 1
DATA SCIENCE AND VISUALIZATION (21CS644)
Questions
Handouts for Session 2: Data Wrangling, Tools and Libraries for Visualization
Data wrangling is the process of transforming raw data into a suitable representation
for various tasks. It is the discipline of augmenting, cleaning, filtering, standardizing,
and enriching data in a way that allows it to be used in a downstream task, which in our
case is data visualization.
Examine the following flow diagram of the data wrangling process to understand how
precise and actionable data is prepared for business analysts to utilize.
Vtucode.in 2
DATA SCIENCE AND VISUALIZATION (21CS644)
For example, employee engagement can be measured based on raw data gathered from
feedback surveys, employee tenure, exit interviews, one-on-one meetings, and so on. This
data is cleaned and made into graphs based on parameters such as referrals, faith in
leadership, and scope of promotions. The percentages, that is, information derived from
the graphs, help us reach our result, which is to determine the measure of employee
engagement.
• Several tools are available for creating data visualizations to suit different needs.
• Non-coding tools like Tableau provide an intuitive interface for exploring and
understanding data.
• Alongside Python, MATLAB and R are also commonly used in data analytics.
• Python stands out as the industry's preferred language due to its user-friendly
nature and efficiency in data manipulation and visualization.
• Its extensive library ecosystem further enhances Python's appeal, making it the
optimal choice for robust data visualization tasks.
Questions:
3. With a neat diagram explain the steps involved in the Data Wrangling process.
Handouts for Session 3: Comparison Plots: Line Chart, Bar Chart and Radar Chart
• Comparison plots include charts that are ideal for comparing multiple variables or
variables over time.
• Line charts are great for visualizing variables over time.
Vtucode.in 3
DATA SCIENCE AND VISUALIZATION (21CS644)
• For comparison among items, bar charts (also called column charts) are the best way to
go. For a certain time period (say, fewer than 10-time points), vertical bar charts can be
used as well.
• Radar charts or spider plots are great for visualizing multiple variables for multiple
groups.
1. Line Chart
• Line charts are used to display quantitative values over a continuous time period
and show information as a series.
• A line chart is ideal for a time series that is connected by straight-line segments.
• The value being measured is placed on the y-axis, while the x-axis is the timescale.
Uses
✓ Line charts are great for comparing multiple variables and visualizing trends for
both single as well as multiple variables, especially if your dataset has many time
periods (more than 10).
✓ For smaller time periods, vertical bar charts might be the better choice.
Example 1: The following diagram shows a trend of real estate prices (per million
US dollars) across two decades. Line charts are ideal for showing data trends:
Vtucode.in 4
DATA SCIENCE AND VISUALIZATION (21CS644)
2. Bar Charts
• In a bar chart, the bar length encodes the value. There are two variants of bar charts:
vertical bar charts and horizontal bar charts.
Uses
• While they are both used to compare numerical values across categories, vertical
bar charts are sometimes used to show a single variable over time.
Example 1:
The following diagram shows a vertical bar chart. Each bar shows the marks out of
100 that 5 students obtained in a test:
Vtucode.in 5
DATA SCIENCE AND VISUALIZATION (21CS644)
The following diagram shows a horizontal bar chart. Each bar shows the marks out
of 100 that 5 students obtained in a test:
Example 2:
The below graph compares movie ratings with two scores: the Tomatometer,
representing the percentage of approved critic reviews, and the Audience Score,
representing the percentage of users rating 3.5 or higher out of 5. Notably, The
Vtucode.in 6
DATA SCIENCE AND VISUALIZATION (21CS644)
Martian has high scores on both metrics. The Hobbit: An Unexpected Journey has a
high Audience Score despite a lower Tomatometer score, likely due to its large fan
base.
Design Practices
1. When creating bar charts, ensure the numerical axis starts at zero to avoid
misleading representations.
2. Use horizontal labels if the chart isn't too cluttered.
3. If space is limited, rotate the labels at different angles, as seen on the x-axis of the
preceding diagram.
3. Radar Charts
• Radar charts (also known as spider or web charts) visualize multiple variables with each
variable plotted on its own axis, resulting in a polygon.
• All axes are arranged radially, starting at the center with equal distances between one
another, and have the same scale.
Uses
✓ Radar charts are great for comparing multiple quantitative variables for a single
group or multiple groups.
✓ They are also useful for showing which variables score high or low within a dataset,
making them ideal for visualizing performance.
Vtucode.in 7
DATA SCIENCE AND VISUALIZATION (21CS644)
Example 1:
The following diagram shows a radar chart for a single variable. This chart displays
data about a student scoring marks in different subjects:
Vtucode.in 8
DATA SCIENCE AND VISUALIZATION (21CS644)
Example 3:
The following diagram shows a radar chart for multiple variables/groups. Each chart
displays data about a student's performance in different subjects:
Design Practices
1. Try to display 10 factors or fewer on a single radar chart to make it easier to read.
2. Use faceting (displaying each variable in a separate plot) for multiple variables/
groups, as shown in the preceding diagram, in order to maintain clarity.
Questions:
2. Explain what Line, Bar and Radar charts are. Also explain their uses and design practices
with examples.
Vtucode.in 9
DATA SCIENCE AND VISUALIZATION (21CS644)
Handouts for Session 4: Relation Plots: Scatter Plot, Bubble Plot , Correlogram and
Heatmap
A bubble plot, which is a variation of the scatter plot, is an excellent tool for visualizing
the correlation of a third variable.
Examples
The following diagram shows a scatter plot of height and weight of persons belonging
to a single group: Scatter plot with a single group
Vtucode.in 10
DATA SCIENCE AND VISUALIZATION (21CS644)
The following diagram shows the same data as in the previous plot but differentiates between
groups. In this case, we have different groups: A, B, and C:
Design Practices
In addition to the scatter plot, which visualizes the correlation between two numerical
variables, you can plot the marginal distribution for each variable in the form of
histograms to give better insight into how each variable is distributed.
Example
The following diagram shows the correlation between body mass and the maximum
longevity for animals in the Aves class. The marginal histograms are also shown, which
helps to get a better insight into both variables:
Vtucode.in 11
DATA SCIENCE AND VISUALIZATION (21CS644)
Correlation between body mass and maximum longevity of the Aves class with
marginal histograms
2. Bubble Plot
• A bubble plot extends a scatter plot by introducing a third numerical variable.
• The value of the variable is represented by the size of the dots.
• The area of the dots is proportional to the value.
• A legend is used to link the size of the dot to an actual numerical value.
Uses
Example
The following diagram shows a bubble plot that highlights the relationship between heights
and age of humans to get the weight of each person, which is represented by the size of the
bubble:
Vtucode.in 12
DATA SCIENCE AND VISUALIZATION (21CS644)
Bubble plot showing the relation between height and age of humans
Design Practices
1. The design practices for the scatter plot are also applicable to the bubble plot.
2. Don't use bubble plots for very large amounts of data, since too many bubbles make
the chart difficult to read.
3. Correlogram
• A correlogram is a combination of scatter plots and histograms.
• A correlogram or correlation matrix visualizes the relationship between each
pair of numerical variables using a scatter plot.
• The diagonals of the correlation matrix represent the distribution of each variable
in the form of a histogram.
• Different colors can also be used to plot the relationship between multiple groups
or categories.
• A correlogram is a great chart for exploratory data analysis to get a feel for your
data, especially the correlation between variable pairs.
Vtucode.in 13
DATA SCIENCE AND VISUALIZATION (21CS644)
Examples
The following diagram shows a correlogram for the height, weight, and age of humans. The
diagonal plots show a histogram for each variable. The off-diagonal elements show scatter
plots between variable pairs:
Design Practices
Vtucode.in 14
DATA SCIENCE AND VISUALIZATION (21CS644)
The following diagram shows the correlogram with data samples separated by color into
different groups:
4. Heatmap
• A heatmap is a visualization where values contained in a matrix are represented
as colors or color saturation.
• Heatmaps are great for visualizing multivariate data (data in which analysis is
based on more than two variables per observation).
• In heatmaps categorical variables are placed in the rows and columns and a
numerical or categorical variable is represented as colors or color saturation.
Use
• The visualization of multivariate data can be done using heatmaps as they are
great for finding patterns in your data.
Vtucode.in 15
DATA SCIENCE AND VISUALIZATION (21CS644)
Example:
The following diagram shows a heatmap for the most popular products on the electronics
category page across various e-commerce websites, where the color shows the number of units
sold. In the following diagram, we can analyze that the darker colors represent more units sold,
as shown in the key:
Let's see the same example we saw previously in an annotated heatmap, where the color
shows the number of units sold:
Vtucode.in 16
DATA SCIENCE AND VISUALIZATION (21CS644)
Questions:
2. Explain what Scatter Plot, Bubble Plot, Correlogram and Heatmap are. Also explain their
uses and design practices with examples.
Handouts for Session 5: Composition Plots: Pie Chart, Stacked Bar Chart, Stacked Area
Chart, Venn Diagram
• For static data, you can use pie charts, stacked bar charts, or Venn diagrams.
• Pie charts or donut charts help show proportions and percentages for groups.
• Venn diagrams are the best way to visualize overlapping groups, where each group
is represented by a circle.
• For data that changes over time, you can use either stacked bar charts or stacked
area charts.
1. Pie Chart
• Pie charts illustrate numerical proportions by dividing a circle into slices.
• Each arc length represents a proportion of a category.
• The full circle equates to 100%.
• For humans, it is easier to compare bars than arc lengths; therefore, it is
recommended to use bar charts or stacked bar charts the majority of the time.
Use
• To compare items that are part of a whole.
Examples
The following diagram shows household water usage around the world:
Vtucode.in 17
DATA SCIENCE AND VISUALIZATION (21CS644)
Design Practices
1. Arrange the slices according to their size in increasing/decreasing order, either in a
clockwise or counter clockwise manner.
2. Make sure that every slice has a different color.
Vtucode.in 18
DATA SCIENCE AND VISUALIZATION (21CS644)
Design Practice
1. Use the same color that's used for the category for the subcategories.
Use
• To compare variables that can be divided into sub-variables.
Example 1:
The following diagram shows a generic stacked bar chart with five groups: Stacked bar
chart to show sales of laptops and mobiles
Vtucode.in 19
DATA SCIENCE AND VISUALIZATION (21CS644)
The following diagram shows a 100% stacked bar chart with the same data that was
used in the preceding diagram:
100% stacked bar chart to show sales of laptops, PCs, and mobiles
Example 2:
The following diagram illustrates the daily total sales of a restaurant over several days.
The daily total sales of non-smokers are stacked on top of the daily total sales of
smokers:
Vtucode.in 20
DATA SCIENCE AND VISUALIZATION (21CS644)
Design Practices
• Use contrasting colors for stacked bars.
• Ensure that the bars are adequately spaced to eliminate visual clutter.
• The ideal space guideline between each bar is half the width of a bar.
• Categorize data alphabetically, sequentially, or by value, to uniformly order it and
make things easier for your audience.
3. Stacked Area Chart
• Stacked area charts show trends for part-of-a-whole relations.
• The values of several groups are illustrated by stacking individual area charts on top
of one another.
• It helps to analyze both individual and overall trend information.
Use
• To show trends for time series that are part of a whole.
Examples
The following diagram shows a stacked area chart with the net profits of Google,
Facebook, Twitter, and Snapchat over a decade:
Design Practice
2. This helps in analyzing overlapping data and makes the grid lines visible.
Vtucode.in 21
DATA SCIENCE AND VISUALIZATION (21CS644)
4. Venn Diagram
• Venn diagrams, also known as set diagrams, show all possible logical relations
between a finite collection of different sets.
• Each set is represented by a circle.
• The circle size illustrates the importance of a group.
• The size of overlap represents the intersection between multiple groups.
Use
✓ To show overlaps for different sets.
Example
Visualizing the intersection of the following diagram shows a Venn diagram for
students in two groups taking the same class in a semester:
From the preceding diagram, we can note that there are eight students in just group A, four
students in just group B, and one student in both groups.
Design Practice
1. It is not recommended to use Venn diagrams if you have more than three groups. It
would become difficult to understand.
2. Moving on from composition plots, we will cover distribution plots in the following
section.
Vtucode.in 22
DATA SCIENCE AND VISUALIZATION (21CS644)
Questions
2. Explain in detail Pie Charts. How Donut Chart can be more convenient than Pie Chart?
3. Explain the Stacked Bar Charts with an example. Also explain the uses and the design
practices to ne followed.
Handouts for Session 6: Distribution Plots: Histogram, Density Plot, Box Plot, Violin Plot
• Distribution plots give a deep insight into how your data is distributed.
• For a single variable, a histogram is effective.
• For multiple variables, you can either use a box plot or a violin plot.
• The violin plot visualizes the densities of your variables, whereas the box plot just
visualizes the median, the interquartile range, and the range for each variable.
1. Histogram
• A histogram visualizes the distribution of a single numerical variable.
• Each bar represents the frequency for a certain interval.
• Histograms provide an estimate of statistical measures, revealing where values are
concentrated and making it easy to detect outliers.
• A histogram can be plotted using absolute frequency values or, alternatively, by
normalizing the values.
• Different colors for the bars can be used to compare distributions of multiple
variables.
Use
• Get insights into the underlying distribution for a dataset.
Example
The following diagram shows the distribution of the Intelligence Quotient (IQ) for a
test group. The dashed lines represent the standard deviation each side of the mean (the
solid line):
Vtucode.in 23
DATA SCIENCE AND VISUALIZATION (21CS644)
2. Density Plot
• One advantage these have over histograms is that density plots are better at
determining the distribution shape since the distribution shape for histograms heavily
depends on the number of bins (data intervals).
Use
• To compare the distribution of several variables by plotting the density on the
same axis and using different colors.
Vtucode.in 24
DATA SCIENCE AND VISUALIZATION (21CS644)
Example
The following diagram shows a basic density plot:
Design Practices
1. Use contrasting colors to plot the density of multiple variables.
Vtucode.in 25
DATA SCIENCE AND VISUALIZATION (21CS644)
3. Box Plot
• The box extends from the lower to the upper quartile values of the data, thus
allowing us to visualize the interquartile range (IQR).
• The parallel extending lines from the boxes are called whiskers; they indicate the
variability outside the lower and upper quartiles.
• There is also an option to show data outliers, usually as circles or diamonds, past the
end of the whiskers.
Use
✓ Compare statistical measures for multiple variables or groups.
Example
The following diagram shows a basic box plot that shows the height of a group of people:
The following diagram shows a basic box plot for multiple variables. In this case, it shows
heights for two different groups – adults and non-adults:
Vtucode.in 26
DATA SCIENCE AND VISUALIZATION (21CS644)
4. Violin Plot
• The thick black bar in the center represents the interquartile range, while the thin
black line corresponds to the whiskers in a box plot.
Use
✓ Compare statistical measures and density for multiple variables or groups.
Example
The following diagram shows a violin plot for a single variable and shows how students have
performed in Math: From the diagram, we can analyze that most of the students have scored
around 40-60 in the Math test.
Vtucode.in 27
DATA SCIENCE AND VISUALIZATION (21CS644)
The following diagram shows a violin plot for two variables and shows the performance of
students in English and Math:
Vtucode.in 28
DATA SCIENCE AND VISUALIZATION (21CS644)
The following diagram shows a violin plot for a single variable divided into three groups, and
shows the performance of three divisions of students in English based on their score:
Questions
Vtucode.in 29
DATA SCIENCE AND VISUALIZATION (21CS644)
Handouts for Session 7: Geo Plots: Dot Map, Choropleth Map, Connection Map
Example
The following diagram shows a dot map where each dot represents a certain amount of bus
stops throughout the world:
Design Practices
1. Avoid displaying too many locations to ensure the map remains clear and the actual
locations are discernible.
2. Select an appropriate dot size and value to ensure that in dense areas, the dots blend
together, providing a clear impression of the underlying spatial distribution.
2. Choropleth Map
• For example, a tile represents a geographic region for counties and countries.
• Choropleth maps provide a good way to show how a variable varies across a
geographic area.
• One thing to keep in mind for choropleth maps is that the human eye naturally gives
more attention to larger areas, so you might want to normalize your data by dividing
the map area-wise.
Use
✓ To visualize geospatial data grouped into geological regions—for example,
states or countries.
Example
The following diagram shows a choropleth map of a weather forecast in the USA:
Vtucode.in 31
DATA SCIENCE AND VISUALIZATION (21CS644)
Design Practices
1. Use darker colors for higher values, as they are perceived as being higher in
magnitude.
2. Limit the color gradation, since the human eye is limited in how many colors it can
easily distinguish between. Seven color gradations should be enough.
3. Connection Map
• The link between the locations can be drawn with a straight or rounded line,
representing the shortest distance between them.
• Each line has the same thickness and value (the number of connections each line
represents).
• The lines are not meant to be counted; they are only intended to give an impression of
magnitude.
• The size and value of a connection line are important factors for the effectiveness and
impression of the visualization.
Use
✓ To visualize connections.
Example
The following diagram shows a connection map of flight connections around the world:
Vtucode.in 32
DATA SCIENCE AND VISUALIZATION (21CS644)
Design Practices
1. Avoid displaying too many connections, as it can make data analysis challenging.
Ensure the map remains clear enough to identify the actual locations of the start and
end points.
2. Choose a line thickness and value so that the lines start to blend in dense areas. The
connection map should give a good impression of the underlying spatial
distribution.
Questions:
3. Explain the uses of Choropleth Map and what are the design practices to be followed.
✓ Use colors to differentiate variables/subjects rather than symbols, as colors are more
perceptible.
✓ To show additional variables on a 2D plot, use color, shape, and size.
✓ Keep it simple and don't overload the visualization with too much information.
Vtucode.in 33