Unit 5
Unit 5
Data Visualization
Introduction
Bar Charts, Pie Charts, Box Plots, Scatter Plots, Bubble Plots
Creating Maps & Visualizing Geospatial Data – Folium, Maps with Markers, Choropleth
Maps.
An Introduction
Visualizing data is a crucial part of data analysis as it helps people uncover insights that
may be hidden in raw tables.
Data visualization tools help in creating visual representations of the data that can
highlight trends, patterns, correlations, and outliers that would be harder to see in a
dataset presented in raw form.
• Visual variables such as color, size, shape, and position are used to encode data
attributes. For example, a scatter plot might use the x-axis and y-axis to represent two
variables and color to represent a third variable.
Types of Visualizations:
• Charts and Graphs: Line charts, bar charts, pie charts, histograms, scatter plots,
etc.
• Effectively using visualizations to tell a compelling story that leads the viewer
through key insights.
Color Theory:
• Ensuring that visualizations accurately represent the underlying data and avoid
misleading interpretations.
There are different types of data visualization tools, each serving a unique purpose.
Python is one of the most popular languages for data analysis, and it offers several
libraries for data visualization.
o One of the oldest and most widely used Python libraries for static,
animated, and interactive visualizations.
o Provides a variety of plot types, including line plots, bar charts, histograms,
scatter plots, and more.
o It allows you to control all aspects of the plot, such as axes, titles, labels, and
legends.
Matplotlib Basics:
Matplotlib is typically used in conjunction with NumPy for handling numerical data. The
basic steps to create a simple plot involve:
1. Importing Matplotlib:
Creating Data:
Creating a Plot:
• Use Matplotlib functions to create a figure and one or more axes (subplots)
Examples:
# Line plot
plt.plot(x, y)
# Scatter plot
plt.scatter(x, y)
# Bar plot
plt.bar(x, height)
# Histogram
plt.hist(data, bins=10)
2. Adding Labels and Titles:
• Add labels to the axes and a title to the plot for clarity.
plt.xlabel('X-axis Label')
plt.ylabel('Yaxis Label')
plt.title('Plot Title')
3. Displaying the Plot:
• Use plt.show()
to display the plot in a standalone script or non-interactive mode.
• In Jupyter Notebooks, %matplotlib inline may be used to display plots inline.
Example:
Here's a simple example of creating a line plot using Matplotlib:
import matplotlib.pyplot as plt
import numpy as np
Output:
This code creates a plot of a sine wave using the matplotlib library.
2 * np.pi: The stopping value of the range (2π, which is approximately 6.2832).
This corresponds to one full period of the sine wave.
This line generates 100 values between 0 and 2π, which will serve as the x-
coordinates for plotting the sine wave.
2. y = np.sin(x)
This line calculates the sine values corresponding to each x value. The
resulting y array will contain the sine of all 100 x values.
Since the sine function is periodic, this array will represent the y-coordinates for
the sine wave.
3. plt.plot(x, y)
This line creates the sine wave plot, where the x-values are plotted on the
horizontal axis (X-axis), and the sine values (y-values) are plotted on the vertical
axis (Y-axis).
4. plt.xlabel('X-axis')
6. plt.title('Sine Wave')
7. plt.show()
It renders the plot and shows the sine wave on the screen with the specified
labels and title.
Visual Representation:
The resulting plot will display a sine wave, showing the relationship between x (the
input angle) and y (the sine of that angle). The x-values range from 0 to 2π, and the sine
function oscillates between -1 and 1, creating a smooth periodic curve.
Final Plot:
Installation:
If you haven't installed Matplotlib yet, you can install it using: bash
pip install matplotlib
Basic Line Plot:
Line Plot
A line plot (also called a line chart) is used to display data points in a sequence,
typically over time. It connects the individual data points with a line, making it ideal for
showing trends or patterns in data.
Trends over Time: When you want to visualize how a variable changes over
time
Example: You might use a line plot to show monthly sales figures over the past year
or the change in temperature over several weeks.
Continuous Data: Line plots are best for continuous data where each data point is
connected to its predecessor.
Example: Showing the relationship between distance and time for an object
moving at a constant speed.
Output:
Scatter Plot:
Example: You can use a scatter plot to check the relationship between hours
studied and exam scores.
Outliers and Clusters: Scatter plots help identify outliers (data points that don't
fit the pattern) or clusters (groups of data points).
Non-linear Relationships: Scatter plots are particularly useful when you don’t
know if the relationship between variables is linear, and you just want to see the
spread of data points.
o Example: You might use a scatter plot to see if there’s any correlation
between age and income.
# Generate some example data
np.random.seed(0)
x = np.random.rand(50) * 10 # Random x values
y = 2 * x + 5 + np.random.randn(50) * 2 # Linear relationship with some noise
# Scatter plot
plt.scatter(x, y)
plt.xlabel('X')
plt.ylabel('Y')
plt.title('Scatter Plot: X vs Y')
plt.show()
Output:
Output:
Bar Plot:
Output:
A bar plot (also called a bar chart) is used to compare discrete categories of data
by using rectangular bars with lengths proportional to the values they represent. Bar
plots are especially useful for comparing quantities across different groups or
categories.
Output:
Area Plot:
Output:
An area plot is a type of plot that displays the data along with the area between
the curve and the x-axis, often used to show the cumulative total of the data. It is
particularly useful when you want to visualize the relationship between quantities over
time and highlight the magnitude of the change.
An area plot that shows the cumulative values of two datasets (Sales A and Sales B) over
time. The area under each curve is filled with color, showing how each series
contributes to the total at each point.
Piechart
Output:
Proportional Data: When you want to visualize the relative sizes of parts that
make up a whole.
Category Comparison: When you have a small number of categories (ideally 2
to 6) and want to see how each category contributes to the total.
Parts of a Whole: When the sum of the data points is important, and you want to
emphasize how each category contributes to the total.
Box Plots
Output:
When to Use a Box Plot:
Data Distribution: When you want to show the spread of data and identify the
central tendency and variability.
Outliers Detection: Useful for detecting outliers or extreme values in a dataset.
Comparing Multiple Groups: When comparing distributions of multiple groups
(e.g., comparing test scores across different groups).
Skewness and Symmetry: When you want to check if the data is symmetric or
skewed (left-skewed or right-skewed).
A histogram is used to show the frequency distribution of a dataset. It groups data into
bins (intervals) and displays the number of data points that fall into each bin.
# Create histogram
plt.hist(data, bins=20, color='skyblue', edgecolor='black')
plt.title('Age Distribution')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()
Output:
When to Use a Histogram:
Bubble Plots
# Sample data
x = np.random.rand(20) * 10 # Random x values
y = np.random.rand(20) * 10 # Random y values
sizes = np.random.rand(20) * 1000 # Bubble sizes
Seaborn
Plotly
import plotly.express as px
data = px.data.iris()
fig = px.scatter(data, x="sepal_width", y="sepal_length", color="species")
fig.show()
Bokeh
Altair
ggplot (plotnine)
Pyplot
Let’s look at a practical example of how data visualization tools can help in data analysis,
using Python's Seaborn and Matplotlib libraries.
Imagine you are working as a data analyst at a retail company, and you have a dataset
containing sales information for the last year. The dataset includes the following columns:
import pandas as pd
#Sample dataset
data = {
"Date": ["2024-01-01", "2024-01-02", "2024-01-03", "2024-01-04", "2024-01-05"],
"Product": ["Product A", "Product B", "Product A", "Product C", "Product B"],
"Region": ["North", "South", "East", "West", "North"],
"Sales Amount": [100, 150, 200, 250, 300],
"Units Sold": [10, 15, 20, 25, 30]
}
df = pd.DataFrame(data)
df["Date"] = pd.to_datetime(df["Date"])
You want to understand how the sales amount has changed over time. A line plot can
help you visualize trends.
Output:-
Insight from the Line Plot:
This line plot helps you quickly identify whether there are certain periods where sales
were higher or lower. For example, you may see a peak in sales on certain days or periods
with lower sales, which could be linked to external factors such as holidays, marketing
campaigns, or promotions.
Next, you want to know which products are the best sellers based on sales amount. A
bar plot can provide a clear comparison between different products.
Output:-
This bar plot allows you to compare the sales performance of each product. From the plot,
you can quickly determine which product generates the most revenue. For example, if
"Product A" shows the highest sales, you might prioritize marketing efforts or stock
management around it.
You want to understand how sales vary across different regions. A box plot is useful to
visualize the spread of sales values and identify any outliers in the data.
Output:-
This box plot helps you identify the distribution of sales across different regions. For
example, if the "North" region has a higher median sales value and fewer outliers, you
might conclude that it’s a more stable and profitable region. On the other hand, if another
region shows more variability (larger spread), you could investigate whether certain
factors like promotions or seasonal changes are impacting sales in that region.
Summary of How Data Visualization Helps
Trends and Patterns: The line plot helps you identify trends over time. You can
see whether sales are increasing, decreasing, or fluctuating based on specific
periods.
Comparative Analysis: The bar plot provides a clear comparison between
products. You can identify the best-selling products and allocate resources
accordingly.
Identifying Outliers and Variability: The box plot shows the distribution of sales
data across regions, highlighting outliers and the spread of sales figures. This helps
in understanding regional differences in performance and focusing on regions that
need improvement.
Additional visualizations, including Area Plots, Histograms, Pie Charts, Box Plots,
Scatter Plots, and Bubble Plots
1. Area Plot
Area plots are useful for visualizing trends over time, similar to line plots but with the
area under the line filled to highlight the magnitude.
Output:-
Visualization: An area plot of sales trends will fill the area below the line, showing the
volume of sales more clearly.
2. Histogram
Histograms are great for understanding the distribution of a single variable, such as
sales amount or units sold.
Output:-
# Histogram for Units Sold
plt.figure(figsize=(10, 6))
plt.hist(df['Units Sold'], bins=20, color='lightcoral', edgecolor='black')
plt.title('Distribution of Units Sold')
plt.xlabel('Units Sold')
plt.ylabel('Frequency')
plt.show()
Output:
Visualization: Histograms will show the frequency of different ranges of sales amounts
and units sold. You can see whether most sales fall into small or large ranges.
3. Pie Chart
Pie charts are great for showing the percentage distribution of a whole, such as regional
sales contributions.
Output:
Visualization: A pie chart will clearly show the proportion of sales from each region as
segments of the pie.
4. Box Plot
Box plots (also called box-and-whisker plots) are useful for visualizing the spread of a
dataset, including the median, quartiles, and outliers.
Output:
Visualization: The box plot will display the median, interquartile range (IQR), and any
potential outliers in the sales amount and units sold.
5. Scatter Plot
Scatter plots are useful for identifying relationships or correlations between two
numeric variables, such as Units Sold vs Sales Amount.
Output:
Visualization: A scatter plot will allow you to visually assess if there's a relationship
between the number of units sold and the sales amount.
6. Bubble Plot
Bubble plots are an extension of scatter plots where the size of the marker represents a
third variable. This can help visualize the relationship between Units Sold, Sales
Amount, and an additional variable, such as Region.
Output:
Visualization: The bubble plot will show the correlation between units sold and sales
amount, with the bubble size representing the total sales amount, and color coding
based on regions.
Subplot
The argument (111) is a shorthand way of specifying the grid layout and the
location of the subplot within the grid.
subplot(111):
The first digit (1): This specifies the number of rows in the grid.
The second digit (1): This specifies the number of columns in the grid.
The third digit (1): This specifies the index of the subplot you want to create in
that grid.
subplot(111) means:
If you wanted to create multiple subplots (say, a 2x2 grid), you would use the following
code:
# Subplot 1 (top-left)
ax1 = fig.add_subplot(221)
# Subplot 2 (top-right)
ax2 = fig.add_subplot(222)
# Subplot 3 (bottom-left)
ax3 = fig.add_subplot(223)
# Subplot 4 (bottom-right)
ax4 = fig.add_subplot(224)
Explanation:
For multiple subplots, you can use subplot(nrows, ncols, index) to specify the grid
dimensions and the position of each subplot.
Plotly:
Seaborn:
Built on top of Matplotlib, Seaborn simplifies the creation of visually appealing statistical
graphics.
Provides high-level abstractions for creating complex plots with minimal code.
HoloViews:
Other Libraries:
Waffle charts are an excellent way to display proportions and show how a category
contributes to a total.
Waffle charts are a popular alternative to pie charts, representing data proportions as
colored cells in a grid, offering a more visually accessible and easier-to-read
representation, especially when dealing with multiple categories.
# Number of cells in the waffle chart (e.g., 10x10 grid = 100 cells)
total_cells = 100
Example: 2
import pandas as pd
import matplotlib.pyplot as plt
from pywaffle import Waffle
# creation of a dataframe
data ={'Fruits': ['Apples', 'Banana', 'Mango', 'Strawberry', 'Orange'], 'stock': [20, 11, 18,
25, 8] }
df = pd.DataFrame(data)
#To plot the waffle Chart
fig = plt.figure(FigureClass = Waffle, columns=10, rows = 5, values = df.stock,
icons='face-smile',labels=list(df.Fruits))
plt.show()
Output:
Example 3
import pandas as pd
import matplotlib.pyplot as plt
from pywaffle import Waffle
# creation of a dataframe
data ={'Fruits': ['Apples', 'Banana', 'Mango', 'Strawberry', 'Orange'], 'stock': [20, 11, 18,
25, 8] }
df = pd.DataFrame(data)
#To plot the waffle Chart
fig = plt.figure(FigureClass = Waffle, columns=10, rows = 5, values = df.stock,
icons='cat',labels=list(df.Fruits))
plt.show()
Output:
Word Clouds
Word Clouds are a popular tool to visualize text data. They can be used to
display the most frequent terms or categories in a visually appealing way. Here, you can
visualize the most frequently sold products or regions, for example.
use:
Output:
Example :
# Sample text
text = "Python is a great programming language for data analysis and machine learning.
Python is also used for web development."
wc.generate(text)
plt.figure(figsize=(10, 6))
plt.show()
Explanation:
1. Text Data: The text text is a simple string that contains a few words. You can
replace this with any text data or dataset you have.
2. WordCloud Object: The WordCloud object is created with specified width,
height, and background color.
3. generate() Method: This method generates the word cloud from the input text.
4. plt.imshow(): This function from matplotlib is used to display the generated
word cloud.
Result:
Running this code will display a word cloud where words like "Python" appear larger
since they occur more frequently in the text.
Seaborn makes it easy to plot regression lines and assess relationships between
variables.
For example, you can use regression plots to visualize the relationship between Units
Sold and Sales Amount.
Output:
Explanation:
sns.regplot() fits a regression line and plots a scatter plot at the same time.
The scatter_kws argument allows customization of the scatter plot (color, size,
etc.).
The line_kws argument customizes the regression line style.
These advanced visualizations can help reveal insights from your data, especially
when exploring trends, distributions, and relationships.
Creating Maps:
To create maps and visualize geospatial data in Python, you can use various libraries
such as folium,
geopandas,
plotly, and
cartopy.
Each has its own strengths, such as interactivity, customization, and ease of use for
geospatial operations.
1. Create Maps
2. Add Markers to Maps
3. Visualize Geospatial Data (such as points, polygons, etc.)
4. Example with Folium (Interactive Maps)
Using Folium to Create Interactive Maps and Add Markers
import folium
# Create a map centered around a specific location (latitude and longitude)
# For example, let's center it around New York City (latitude: 40.7128, longitude: -
74.0060)
map_center = [40.7128, -74.0060] #new York city coordinators
1. Create a Map: The folium.Map() function is used to create a map. You specify the
center of the map using latitude and longitude (location=[40.7128, -74.0060] for
New York City).
2. Zoom Level: The zoom_start=12 argument sets the zoom level of the map when
it first loads.
3. Marker: A marker is added at the location of New York City, with a pop-up text
of "New York City".
4. Display the Map: In Jupyter or Google Colab, the mymap object will display the
interactive map.
Output:
The code will generate an interactive map centered on New York City, and when
you click on the marker, a popup will appear with the text "New York City."
Note: In Colab, the map will render directly in the notebook without needing plt.show().
If you're using Jupyter Notebook, just displaying the mymap object will show the map as
well.
Example: Creating an Interactive Map with Markers
import folium
# Latitude and Longitude for New York City, Los Angeles, and Chicago
nyc = [40.7128, -74.0060] # New York City
la = [34.0522, -118.2437] # Los Angeles
chicago = [41.8781, -87.6298] # Chicago
london = [51.5074, -0.1278] # London
# Center the map between the cities (average of latitudes and longitudes)
map_center = [ (nyc[0] + la[0] + chicago[0] + london[0]) / 4,
(nyc[1] + la[1] + chicago[1] + london[1]) / 4 ]
# Create the map with a zoom level that covers all locations
mymap = folium.Map(location=map_center, zoom_start=2)
# Adjust zoom_start to fit all cities
After running the code, the map is saved as city_map.html. You can open this
HTML file in a browser to see the map with interactive features like zooming and
panning.
In this example:
Map center: We set the map's initial view to New York City (latitude: 40.7128,
longitude: -74.0060).
Markers: We added markers for New York, Los Angeles, and Chicago with
popups displaying the city names.
Custom Icons: A marker for London is added with a custom green icon.
You can also customize the map with different icons for the markers, which can
represent different types of locations (e.g., parks, restaurants, landmarks, etc.).
import folium
# Create a map centered at New York City
mymap = folium.Map(location=[40.7128, -74.0060], zoom_start=12)
This will create a map with two markers—one for New York City and another for Los
Angeles—with custom icons and different colors.
Visualizing Geospatial Data using Folium in Python
One of the most important tasks for someone working on datasets with
countries, cities, etc. is to understand the relationships between their data’s
physical location and their geographical context. And one such way to
visualize the data is using Folium.
Using folium.Map(), we will create a base map and store it in an object. This
function takes location coordinates and zoom values as arguments.
Parameters:
location: list of location coordinates
tiles: default is OpenStreetMap. Other options: tamen Terrain, Stamen
Toner, Mapbox Bright etc.
zoom_start: int
output:
In this example:
Heatmap: Points are used to create a heatmap visualization, where areas with a
higher concentration of points are highlighted.
Folium Plugins: HeatMap is used from folium.plugins to create the heatmap.
3. Using Geopandas for Geospatial Data Analysis
GeoPandas is a library for geospatial data analysis that builds on Pandas and allows
you to read, manipulate, and plot geospatial data formats like shapefiles, GeoJSON, and
more.
Install geopandas:
pip install geopandas
In this example:
Plotly can be used to create interactive visualizations for geospatial data. Plotly
provides different types of maps such as scatter_geo and choropleth maps.
Install plotly:
pip install plotly
Example: Creating an Interactive Map with Plotly
import plotly.express as px
# Convert to DataFrame
import pandas as pd
df = pd.DataFrame(data)
Output:
Plotly Express: This code creates an interactive map that visualizes cities in the
U.S. by their population.
Interactive Features: You can zoom in, zoom out, and hover over points to see
more information.
Cartopy is another library used to create static maps, especially for scientific
applications, and it can be paired with matplotlib for customization.
Install Cartopy:
pip install cartopy
Example: Creating a Simple Map with Cartopy
import cartopy.crs as ccrs
import cartopy.feature as cfeature
import matplotlib.pyplot as plt
# Create a plot with a specific projection
fig = plt.figure(figsize=(10, 6))
ax = fig.add_subplot(111, projection=ccrs.PlateCarree()) # PlateCarree projection
# Add coastlines and country borders
ax.coastlines()
ax.add_feature(cfeature.BORDERS)
# Add a title
plt.title('World Map using Cartopy')
# Display the map
plt.show()
Cartopy allows for advanced map projections and the addition of various map
features (e.g., coastlines, borders).
Key notes:
Choropleth maps are useful for visualizing the intensity of a variable across different
geographic regions. These maps use different colors to represent different data values,
and they are commonly used to visualize things like population density, election results,
or sales data across regions.
This example shows how to create a choropleth map for US states based on population
using Plotly.
# Convert to DataFrame
df = pd.DataFrame(data)
Explanation:
Output:
An interactive choropleth map will be displayed where each state is colored based on its
population. Hovering over each state shows the population of that state.
Folium can also be used to create choropleth maps using GeoJSON data. This method
allows for more complex boundaries, such as counties, countries, or custom regions.
import folium
import requests
# Fetch GeoJSON data for US states (you can replace this with your own GeoJSON file)
url =
'https://raw.githubusercontent.com/codeforamerica/click_that_hood/master/public/d
ata/us-states.geojson'
response = requests.get(url)
geojson_data = response.json()
# Sample data for state populations (You can replace this with real data)
state_data = {
'California': 39538223,
'Texas': 29145505,
'Florida': 21538187,
'New York': 20201249,
'Pennsylvania': 13002700,
'Illinois': 12671821,
'Ohio': 11689100,
'Georgia': 10519475,
'North Carolina': 10439388,
'Michigan': 9986857,
}
Explanation:
We use GeoJSON data for US states, and Folium creates a choropleth map using
the Choropleth class.
The state_data dictionary contains population data for each state. The
key_on='feature.properties.name' argument specifies that the GeoJSON file's
name field is used to match the population data.
The color is determined by the Population values, and you can customize the
color scale using the fill_color argument.
Output:
An interactive choropleth map will be displayed with US states colored based on their
population. You can zoom in, hover over each state, and see the population data.
You can also create choropleth maps for countries using GeoJSON files that include
country boundaries.
Example: Choropleth Map for World Countries (Using World Bank Data)
import folium
import requests
Output:
Plotly allows you to create highly interactive choropleth maps for global regions as well.
The data can be anything you want to visualize—population, GDP, etc.
# Example data: Population of countries (You can replace this with real data)
data = {
'Country': ['United States', 'China', 'India', 'Indonesia', 'Pakistan', 'Brazil', 'Nigeria',
'Bangladesh', 'Russia', 'Mexico'],
'Population': [331002651, 1439323776, 1380004385, 273523615, 220892340,
212559417, 206139589, 164689383, 145912025, 128933395],
}
# Convert to DataFrame
df = pd.DataFrame(data)
# Show map
fig.show()
Explanation:
Choropleth maps are an excellent way to visualize geographic data in Python, and you
can create them using different libraries such as Plotly and Folium.
Waffle Charts and Word Clouds with simple use cases, including multiple variations
for each. These are great for visualizing categorical data or text data respectively.
1. Waffle Chart
A Waffle Chart is a 10x10 grid (100 cells), where each cell represents a percentage of a
category. It's a great alternative to pie charts when you want a more visual, grid-based
display of proportions.
# Fill the grid with the regions based on the percentage of sales
start = 0
colors = plt.cm.Paired.colors # Using color palette for regions
for i, percentage in enumerate(percentages):
end = int(start + percentage)
color = colors[i % len(colors)]
waffle_data.ravel()[start:end] = i + 1 # Fill cells for the region
start = end
Explanation:
Example 2: Waffle Chart with Multiple Categories (Sales, Marketing, and Development)
import matplotlib.pyplot as plt
import numpy as np
for i in range(10):
for j in range(10):
ax.add_patch(plt.Rectangle((j, 9-i), 1, 1, color=colors[int(waffle_data[i, j]) - 1]))
Explanation:
2. Word Cloud
A Word Cloud is a visual representation of text data, where the size of each word
indicates its frequency or importance in the dataset.
Explanation:
In this example, a word cloud is generated from a simple text snippet, with the
size of each word indicating its frequency.
Example 2: Word Cloud with Custom Settings (Frequent Words from a Review Dataset)
from wordcloud import WordCloud
import matplotlib.pyplot as plt
# Sample text data from reviews
reviews = """
Great product! Fast delivery and excellent customer service. I will definitely buy again.
The quality of the product was better than expected. Fast shipping and great customer
support.
I am very happy with my purchase. Excellent quality and fast delivery. Highly
recommend.
"""
Explanation:
This word cloud is created from a set of product reviews, with common
stopwords removed to highlight more meaningful words. The background color
is set to black, and the colormap is set to 'Blues' for a cooler tone.
Example 3: Word Cloud with Image Mask
A word cloud can also be shaped according to a custom image (e.g., a logo or any other
image shape).
Explanation:
In this case, the word cloud is shaped according to a custom image (e.g., a star).
You would need an image file for the mask, and the text is visualized in the shape
of that image.