0% found this document useful (0 votes)
7 views28 pages

Visualization

Uploaded by

23020418
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views28 pages

Visualization

Uploaded by

23020418
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 28

Intr

o
Structur
e

Basic Visualization Next lecture


1. Why visualize 5. Interactive visualization with
2. How do visualize Plotly
6. Pitfall of visualization
3. Visualization tool
4. Case study: US president
election
Topic 1: Visualization
Why visualize

• Number is “boring”, visual is more appealing.


• "Without data visualization, data analysis is like finding a needle in a
haystack... without knowing what a needle looks like!“.
• Human quantify thing visually.  You can easily be fooled by
misleading visualization.

Did we just double the


profit by each month?
Topic 1: Visualization
Why visualize

• Statistical measures, such as


mean and variance, are not
always tell you about the
truth of data.
• All following dataset have
the same mean and
variance, are they the same
data?

 Without properly
visualization, you can never
tell what data is about.
Topic 1: Visualization
How do we visualize

• There are hundred of


different kind of
visualizations.
• Often depend on your
specific data and purpose.
• It can also be static or
interactive visualization
(next lecture),
 You don’t need to
remember all of that, but you
have to know they exist.
Topic 1: Visualization
Distribution plot

To show how data is distributed,


identify patterns, central tendency,
spread, and outliers.
• Histogram: show the
distribution of a variable by
dividing it into bins
• Kernel Density Estimate
(KDE) Plot: A smoothed curve
that represents the
distribution of data.
• Box Plot: Displays the
summary of a dataset’s
distribution with minimum,
first quartile, median, third
quartile, and maximum.
• Violin Plot: Combines a box
plot and a KDE plot to provide
Topic 1: Visualization
Comparison Plots

To compare data across different


groups or over time.
• Bar Plot (Bar Chart):
Compares categorical data
using rectangular bars where
the height/length represents
the value.
• Grouped Bar Plot: Shows
comparisons across multiple
categories within a main
category.
• Line Plot (Line Chart): Used
for tracking changes and
trends over time or ordered
categories.
Topic 1: Visualization
Relationship Plots

To show relationships or
correlations between two or more
variables.
• Scatter Plot: Displays values
for two continuous variables
using points on a 2D space to
see correlations or patterns.
• Bubble Plot: An extension of
the scatter plot that also
shows a third variable through
the size of the bubbles.
• Heatmap: Uses a color
gradient to represent the
relationship and intensity
between two dimensions or
categories.
Topic 1: Visualization
Relationship Plots

To show relationships or
correlations between two or more
variables.
• Scatter Plot: Displays values
for two continuous variables
using points on a 2D space to
see correlations or patterns.
• Bubble Plot: An extension of
the scatter plot that also
shows a third variable through
the size of the bubbles.
• Heatmap: Uses a color
gradient to represent the
relationship and intensity
between two dimensions or
categories.
Topic 1: Visualization
Composition Plots

To show the proportions of a whole


and how they change over time.
• Pie Chart: Shows parts of a
whole as slices of a pie.
• Stacked Bar Chart: Displays
the composition of multiple
values within a bar.
• Area Chart: Similar to a line
chart, but the area beneath
the line is filled to show the
volume.
Topic 1: Visualization
Ranking Plots

To display data in order of


importance or rank.
• Bar Chart
(Ordered/Sorted): Displays
categories sorted by their
value.
• Dot Plot: An alternative to
bar charts that plots points to
indicate rank or order.
Topic 1: Visualization
Part-to-Whole Plots

To represent a part-to-whole
relationship in data.
• Donut Chart: A variation of a
pie chart with a central cut-
out.
• Treemap: Uses nested
rectangles to display data
hierarchies.
Topic 1: Visualization
Visualization tool

Seaborn Matplotlib
Level of Abstraction High-level Low-level
More complex, requires
Ease of Use Easier to learn and use
more code
Built-in themes and More customization
Default Aesthetics color palettes required

Statistical Graphics Specialized functions for Requires more manual


statistical visualizations setup

Requires more data


Integration with Pandas Seamless integration
manipulation
Less flexible
Customization customization Highly customizable
Topic 1: Visualization
Visualization tool

Matplotlib Seaborn
Topic 1: Visualization
Visualization tool
import matplotlib.pyplot as plt import seaborn as sns
import numpy as np import pandas as pd
import numpy as np
# Sample data
You can choose
x = np.linspace(0, 10, 100) between plt and sns.set_palette('Set2')
y = np.sin(x) seaborn
# Sample data in DataFrame format
# Create the plot df = pd.DataFrame({'x': np.linspace(0, 10, 100), 'y': np.sin(x
plt.figure(figsize=(8, 6))
plt.plot(x, y, label='Sine Wave') # Create the plot
sns.lineplot(x='x', y='y', data=df, label='Sine Wave')
# Find peaks
peaks = np.where(np.diff(np.sign(np.diff(y))) < 0)[0] + 1 # Find peaks
peaks = np.where(np.diff(np.sign(np.diff(df['y']))) < 0)[0] +
# Plot points at peaks
plt.plot(x[peaks], y[peaks], 'ro') # Plot points at peaks
sns.scatterplot(x=df['x'][peaks], y=df['y'][peaks], color='red
# Add labels, title, and annotation
plt.xlabel('x-axis') # Add labels, title, and annotation
plt.ylabel('y-axis') plt.xlabel('x-axis')
plt.title('Simple Sine Wave Plot') plt.ylabel('y-axis')
plt.text(5, 0.5, 'Peak Value', fontsize=12) plt.title('Simple Sine Wave Plot')
plt.text(5, 0.5, 'Peak Value', fontsize=12)
# Add source reference outside the plot
plt.figtext(0.05, 0.05, 'Source: Generated Data', fontsize=10) # Add source reference outside the plot
plt.figtext(0.05, 0.02, 'Source: Generated Data', fontsize=10)
plt.grid(True)
plt.legend() plt.show()
plt.show()
Topic 1: Visualization
Visualization tool
import matplotlib.pyplot as plt import seaborn as sns
import numpy as np import pandas as pd
import numpy as np
# Sample data
Prepare data
x = np.linspace(0, 10, 100) sns.set_palette('Set2')
y = np.sin(x)
# Sample data in DataFrame format
# Create the plot df = pd.DataFrame({'x': np.linspace(0, 10, 100), 'y': np.sin(x
plt.figure(figsize=(8, 6))
plt.plot(x, y, label='Sine Wave') # Create the plot
sns.lineplot(x='x', y='y', data=df, label='Sine Wave')
# Find peaks
peaks = np.where(np.diff(np.sign(np.diff(y))) < 0)[0] + 1 # Find peaks
peaks = np.where(np.diff(np.sign(np.diff(df['y']))) < 0)[0] +
# Plot points at peaks
plt.plot(x[peaks], y[peaks], 'ro') # Plot points at peaks
sns.scatterplot(x=df['x'][peaks], y=df['y'][peaks], color='red
# Add labels, title, and annotation
plt.xlabel('x-axis') # Add labels, title, and annotation
plt.ylabel('y-axis') plt.xlabel('x-axis')
plt.title('Simple Sine Wave Plot') plt.ylabel('y-axis')
plt.text(5, 0.5, 'Peak Value', fontsize=12) plt.title('Simple Sine Wave Plot')
plt.text(5, 0.5, 'Peak Value', fontsize=12)
# Add source reference outside the plot
plt.figtext(0.05, 0.05, 'Source: Generated Data', fontsize=10) # Add source reference outside the plot
plt.figtext(0.05, 0.02, 'Source: Generated Data', fontsize=10)
plt.grid(True)
plt.legend() plt.show()
plt.show()
Topic 1: Visualization
Visualization tool
import matplotlib.pyplot as plt import seaborn as sns
import numpy as np import pandas as pd
import numpy as np
# Sample data
Create a plot
x = np.linspace(0, 10, 100) sns.set_palette('Set2')
y = np.sin(x)
# Sample data in DataFrame format
# Create the plot df = pd.DataFrame({'x': np.linspace(0, 10, 100), 'y': np.sin(x
plt.figure(figsize=(8, 6))
plt.plot(x, y, label='Sine Wave') # Create the plot
sns.lineplot(x='x', y='y', data=df, label='Sine Wave')
# Find peaks
peaks = np.where(np.diff(np.sign(np.diff(y))) < 0)[0] + 1 # Find peaks
peaks = np.where(np.diff(np.sign(np.diff(df['y']))) < 0)[0] +
# Plot points at peaks
plt.plot(x[peaks], y[peaks], 'ro') # Plot points at peaks
sns.scatterplot(x=df['x'][peaks], y=df['y'][peaks], color='red
# Add labels, title, and annotation
plt.xlabel('x-axis') # Add labels, title, and annotation
plt.ylabel('y-axis') plt.xlabel('x-axis')
plt.title('Simple Sine Wave Plot') plt.ylabel('y-axis')
plt.text(5, 0.5, 'Peak Value', fontsize=12) plt.title('Simple Sine Wave Plot')
plt.text(5, 0.5, 'Peak Value', fontsize=12)
# Add source reference outside the plot
plt.figtext(0.05, 0.05, 'Source: Generated Data', fontsize=10) # Add source reference outside the plot
plt.figtext(0.05, 0.02, 'Source: Generated Data', fontsize=10)
plt.grid(True)
plt.legend() plt.show()
plt.show()
Topic 1: Visualization
Seaborn have
Visualization tool high-level plot
import matplotlib.pyplot as plt import seaborn as sns function, easier to
import numpy as np import pandas as pd
import numpy as np use
# Sample data
In matplotlib, you
x = np.linspace(0, 10, 100) have to customize sns.set_palette('Set2')
y = np.sin(x) the visualization
# Sample data in DataFrame format
# Create the plot type df = pd.DataFrame({'x': np.linspace(0, 10, 100), 'y': np.sin(x
plt.figure(figsize=(8, 6))
plt.plot(x, y, label='Sine Wave') # Create the plot
sns.lineplot(x='x', y='y', data=df, label='Sine Wave')
# Find peaks
peaks = np.where(np.diff(np.sign(np.diff(y))) < 0)[0] + 1 # Find peaks
peaks = np.where(np.diff(np.sign(np.diff(df['y']))) < 0)[0] +
# Plot points at peaks
plt.plot(x[peaks], y[peaks], 'ro') # Plot points at peaks
sns.scatterplot(x=df['x'][peaks], y=df['y'][peaks], color='red
# Add labels, title, and annotation
plt.xlabel('x-axis') # Add labels, title, and annotation
plt.ylabel('y-axis') plt.xlabel('x-axis')
plt.title('Simple Sine Wave Plot') plt.ylabel('y-axis')
plt.text(5, 0.5, 'Peak Value', fontsize=12) plt.title('Simple Sine Wave Plot')
plt.text(5, 0.5, 'Peak Value', fontsize=12)
# Add source reference outside the plot
plt.figtext(0.05, 0.05, 'Source: Generated Data', fontsize=10) # Add source reference outside the plot
plt.figtext(0.05, 0.02, 'Source: Generated Data', fontsize=10)
plt.grid(True)
plt.legend() plt.show()
plt.show()
Topic 1: Visualization
Visualization tool
import matplotlib.pyplot as plt import seaborn as sns
import numpy as np import pandas as pd
import numpy as np
# Sample data
x = np.linspace(0, 10, 100) sns.set_palette('Set2')
y = np.sin(x)
# Sample data in DataFrame format
# Create the plot Add label for df = pd.DataFrame({'x': np.linspace(0, 10, 100), 'y': np.sin(x
plt.figure(figsize=(8, 6)) axis
plt.plot(x, y, label='Sine Wave') # Create the plot
sns.lineplot(x='x', y='y', data=df, label='Sine Wave')
# Find peaks
peaks = np.where(np.diff(np.sign(np.diff(y))) < 0)[0] + 1 # Find peaks
peaks = np.where(np.diff(np.sign(np.diff(df['y']))) < 0)[0] +
# Plot points at peaks
plt.plot(x[peaks], y[peaks], 'ro') # Plot points at peaks
sns.scatterplot(x=df['x'][peaks], y=df['y'][peaks], color='red
# Add labels, title, and annotation
plt.xlabel('x-axis') # Add labels, title, and annotation
plt.ylabel('y-axis') plt.xlabel('x-axis')
plt.title('Simple Sine Wave Plot') plt.ylabel('y-axis')
plt.text(5, 0.5, 'Peak Value', fontsize=12) plt.title('Simple Sine Wave Plot')
plt.text(5, 0.5, 'Peak Value', fontsize=12)
# Add source reference outside the plot
plt.figtext(0.05, 0.05, 'Source: Generated Data', fontsize=10) # Add source reference outside the plot
plt.figtext(0.05, 0.02, 'Source: Generated Data', fontsize=10)
plt.grid(True)
plt.legend() plt.show()
plt.show()
Topic 1: Visualization
Visualization tool
import matplotlib.pyplot as plt import seaborn as sns
import numpy as np import pandas as pd
import numpy as np
# Sample data
x = np.linspace(0, 10, 100) sns.set_palette('Set2')
y = np.sin(x)
# Sample data in DataFrame format
# Create the plot Seaborn also df = pd.DataFrame({'x': np.linspace(0, 10, 100), 'y': np.sin(x
plt.figure(figsize=(8, 6)) supports
plt.plot(x, y, label='Sine Wave') # Create the plot
matplotlib sns.lineplot(x='x', y='y', data=df, label='Sine Wave')
# Find peaks function
peaks = np.where(np.diff(np.sign(np.diff(y))) < 0)[0] + 1 # Find peaks
peaks = np.where(np.diff(np.sign(np.diff(df['y']))) < 0)[0] +
# Plot points at peaks
plt.plot(x[peaks], y[peaks], 'ro') # Plot points at peaks
sns.scatterplot(x=df['x'][peaks], y=df['y'][peaks], color='red
# Add labels, title, and annotation
plt.xlabel('x-axis') # Add labels, title, and annotation
plt.ylabel('y-axis') plt.xlabel('x-axis')
plt.title('Simple Sine Wave Plot') plt.ylabel('y-axis')
plt.text(5, 0.5, 'Peak Value', fontsize=12) plt.title('Simple Sine Wave Plot')
plt.text(5, 0.5, 'Peak Value', fontsize=12)
# Add source reference outside the plot
plt.figtext(0.05, 0.05, 'Source: Generated Data', fontsize=10) # Add source reference outside the plot
plt.figtext(0.05, 0.02, 'Source: Generated Data', fontsize=10)
plt.grid(True)
plt.legend() plt.show()
plt.show()
Topic 1: Visualization
Structure of a plot

A nice and cohesive plot often


contains:
• Clear Title and Labels
• Axis Labels
• Legends
• Gridlines
• Appropriate Color
• Data source
Topic 1: Visualization
Case study: US president
election
Summary of the U.S. Election
Process:
• Two type of votes: popular vote
and electoral vote
• The popular vote determining
which candidate receives the
state's electors.
• The winner in a state win all
state’s electoral votes.
• Each state have different
number of electoral votes.
• The candidate with at least 270
out of 538 electoral votes wins.
Topic 1: Visualization
Case study: US president
election
Challenger to visualize:
• The state size is not equal to
number of electoral votes
• The vote of each county is not
represent for the whole state
(winner takes it all)
• Population difference is
significant between county and
state.
Can we use a single visualization
to address all those problem?
Topic 1: Visualization
Case study: US president
election

This could be
misleading.
A vast majority of
US area is empty.
Topic 1: Visualization
Case study: US president
election

We can scale the state by it electoral


votes, each hexagon is a electoral
vote.

 The map distorted, we hardly recognize


the US.
 What about majority vote?
Topic 1: Visualization
Case study: US president
election

We keep the original map, but show


number of electoral vote

 Still don’t show the majority vote.


 We don’t know which area
Topic 1: Visualization
Case study: US president
election

The arrow show the orientation of


voters (toward Democrat or Republic)

 Don’t show the electoral vote.


 Don’t scale by it population.
Topic 1: Visualization
Case study: US president
election

Circle size is proportional to the


amount each county’s leading
candidate is ahead.

 Loosing the boundary of states.


 In dense area likes east coast, it
is really hard to see the
visualization.
There is no “one size fit all”
situation, we choose the
visualization that give us what we
want to say.

You might also like