CH 6
CH 6
Visual Analytics
Visual analytics
● Visual analytics combines analytical reasoning with interactive visualizations
to enhance data comprehension.
● Allows humans to leverage computational power while interacting with visuals
to uncover insights.
● Integrates statistical and computational techniques with visualization tools.
Statistical methods are foundational for summarizing, analyzing, and
interpreting data. Visualization tools leverage these techniques to present
insights visually.
Cont.
Examples include:
● Descriptive Statistics:
○ Summarize data using measures like mean, median, variance, and standard deviation.
○ Example: Box plots visualize central tendency and variability.
● Inferential Statistics:
○ Make predictions or test hypotheses.
○ Example: Confidence intervals and p-values can be incorporated into bar charts or scatter plots.
● Distribution Analysis:
○ Understand how data points are spread.
○ Example: Histograms or density plots show data distributions.
● Correlation and Regression:
○ Examine relationships between variables.
○ Example: Scatter plots with trend lines represent correlations, with regression equations displayed.
Cont.
Here are some of the key methods in visual analytics.
1. Interactive Visualization
● Purpose: Allows users to manipulate and interact with data visuals in real
time.
● Techniques:
○ Zooming and Panning: Explore large datasets by focusing on specific parts.
○ Filtering: Narrow down data subsets (e.g., selecting a date range in a time-series graph).
○ Highlighting: Emphasize specific data points or trends.
Cont.
2. Dimensionality Reduction
● Simplifies high-dimensional data while retaining meaningful patterns.
● Techniques:
○ Principal Component Analysis (PCA): Reduces dimensions linearly by transforming features
into orthogonal components.
○ t-SNE (t-Distributed Stochastic Neighbor Embedding): Projects high-dimensional data into two
or three dimensions for cluster visualization.
○ Multidimensional Scaling (MDS): Visualizes similarities or dissimilarities in data.
● Applications: Visualizing clusters in customer segmentation or gene
expression data.
Cont.
3. Anomaly Detection
seaborn:
pandas:
● Primarily a data manipulation tool, pandas integrates basic plotting capabilities for quick visual exploration.
● Example: Using line plots or bar charts to explore trends in data stored in DataFrames.
bokeh:
AI-Driven Insights:
● The platform leverages machine learning to identify hidden patterns, correlations, and
trends in data.
● Suggests relevant visualizations based on the nature of the data and the analysis goals.
Cont.
Interactive Dashboards:
Ease of Use:
● Python tools allow for in-depth control and customization, making them ideal for developers and data
scientists working on complex projects.
● IBM Watson Analytics streamlines the workflow by automating data preparation and insight
generation, catering to non-technical users.
● Python libraries like plotly and bokeh offer advanced interactivity and can handle large datasets
efficiently. Watson Analytics, while user-friendly, may have limitations based on subscription plans.
Construction of Common Data Visualizations
● Building common data visualizations involves understanding the underlying
data, choosing the appropriate visualization type, and employing tools or
frameworks to implement them effectively. Whether it's a simple bar chart to
compare categories or an intricate network diagram to map relationships, the
construction of these visualizations requires a blend of statistical knowledge,
computational techniques, and design principles.
● This section explores the step-by-step process of creating widely used
visualizations, such as bar charts, line graphs, pie charts, heatmaps, and
more. Each visualization type is dissected to understand its purpose,
construction process, and how it aids in simplifying complex data analysis.
Box Plots
● A box plot, also known as a box-and-whisker plot, is a statistical visualization
tool used to summarize the distribution of a dataset in a simple, compact
format.
● It is particularly effective in identifying the central tendency, variability, and
presence of outliers in the data. Box plots are widely used in exploratory data
analysis (EDA) due to their ability to convey key insights at a glance.
Cont.
● Median (Central Line in the Box)
○ The median represents the central value of the dataset when arranged in ascending order. It divides
the data into two equal halves.
○ If the data is symmetrical, the median will be near the center of the box. If the data is skewed, the
median shifts toward the longer tail.
○ Used to identify the central tendency of the dataset. And also for understanding the "middle point" of
a dataset without being affected by extreme values.
● Quartiles (Box Boundaries)
○ Quartiles divide the dataset into four equal parts.
○ Q1 (Lower Quartile): The median of the lower half of the data (25th percentile).
○ Q3 (Upper Quartile): The median of the upper half of the data (75th percentile).
○ Behavior:
○ The box spans from Q1 to Q3, encapsulating the interquartile range (IQR).
○ IQR = Q3 - Q1, representing the middle 50% of the data.
○ Used to Highlight the spread of the central bulk of the data. And helps detect skewness if the box is
unevenly split by the median.
Cont.
● Whiskers (Lines Extending from the Box)
○ Whiskers extend from the quartiles to the smallest and largest data points within a specific
range.
○ Typically, whiskers are capped at 1.5 times the IQR beyond Q1 and Q3.
○ Whiskers show the range of the data excluding outliers.
○ Short whiskers indicate low variability, while long whiskers suggest high variability.
○ Whiskers are used for Providing a quick view of the range of the dataset.
○ Helps to understand how spread out the data values are beyond the central 50%.
● Outliers (Points Beyond Whiskers)
○ Outliers are data points that fall outside the whiskers, typically more than 1.5 times the IQR
from Q1 or Q3.
○ Represented as individual points outside the whiskers.
○ They may indicate rare or extreme values in the dataset.
○ Important for identifying anomalies or unusual trends in the data.
○ Useful in fields like finance or manufacturing where outliers may signal errors or rare events.
Cont.
● Box plots are ideal for comparing the distributions of multiple datasets side by
side.
● They clearly highlight outliers, which may distort statistical analyses.
● Box plots provide a concise summary of variability in data.
● They allow analysts to assess symmetry, skewness, and spread at a glance
without delving into detailed numerical summaries.
Cont.
Python Implementation:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
# Sample Data
data = [7, 8, 8, 10, 15, 18, 21, 25, 30, 30, 35, 40, 42, 45, 50]
# Matplotlib Box Plot
plt.boxplot(data, vert=False)
plt.title("Box Plot Example")
plt.xlabel("Values")
plt.show()
# Seaborn Box Plot
sns.boxplot(x=data)
plt.title("Box Plot Example with Seaborn")
plt.xlabel("Values")
plt.show()
Histograms
● A histogram is a graphical representation of the distribution of numerical data.
It organizes data into intervals, called bins, and counts how many data points
fall into each bin. Histograms are a foundational tool in data visualization for
understanding the shape, spread, and central tendency of a dataset.
● Bins (Intervals of Data)
○ Bins are continuous intervals into which the data range is divided.
○ The width of each bin determines the level of granularity in the histogram.
○ Narrow bins provide more detailed insights but may appear noisy.
○ Wide bins smooth out variations but may obscure finer details.
○ To group data into meaningful intervals for easier interpretation.
■ Example: In a dataset of ages, bins could represent intervals like 0–10, 11–20, and so
on.
Cont.
● Frequencies (Height of Bars)
○ The height of each bar corresponds to the number of data points (frequency) within that bin.
○ Bars with higher frequencies indicate intervals where data points are concentrated.
○ The sum of all frequencies equals the total number of data points.
○ To visualize how often data points occur within each bin.
■ Example: In a histogram of exam scores, a taller bar at 80–90 indicates many students
scored in that range.
Cont.
● Histograms reveal the overall shape of data, such as symmetry, skewness,
and modality (unimodal, bimodal, etc.).
● Areas with tall bars indicate where data points are densely packed.
● Unusually short or isolated bars may indicate outliers or rare occurrences.
● Multiple histograms can be used to compare distributions across categories.
Cont.
Python Implementation:
import matplotlib.pyplot as plt
import numpy as np
# Sample Data
data = np.random.normal(loc=50, scale=10, size=500) # Normally distributed data
# Creating a Histogram
plt.hist(data, bins=20, color='blue', edgecolor='black', alpha=0.7)
# Adding Titles and Labels
plt.title("Histogram Example")
plt.xlabel("Value Range")
plt.ylabel("Frequency")
# Display the Plot
plt.show()
Cont.
Heat Maps
● A heat map is a data visualization technique that uses a color gradient to
represent the intensity or magnitude of data values. Each color corresponds
to a specific range of values, allowing viewers to quickly identify patterns,
trends, and relationships.
● A grid-like visualization where each cell is color-coded based on the value of
the data it represents.
● To display relationships, trends, and densities in large datasets in an intuitive
and visually appealing way.
Cont.
● Data Matrix
○ Represents the underlying data in rows and columns.
○ Each cell in the matrix corresponds to a specific data point or relationship.
○ Example: In a correlation matrix, rows and columns represent variables, and cells show their
correlation values.
● Color Gradient
○ Uses a continuous range of colors to depict data intensity.
○ Higher values are often represented by warmer colors (e.g., red, orange).
○ Lower values are shown using cooler colors (e.g., blue, green).
○ To make it easier to discern high, medium, and low values visually.
● Annotations (Optional)
○ Text or numbers within each cell to indicate exact data values.
○ To provide precise information when needed, enhancing interpretability.
Cont.
Python Implementation
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
# Sample Data: Correlation Matrix
data = np.random.rand(5, 5) # Generate random data
labels = ['A', 'B', 'C', 'D', 'E']
# Create Heat Map
plt.figure(figsize=(8, 6))
sns.heatmap(data, annot=True, cmap='coolwarm', xticklabels=labels, yticklabels=labels)
# Add Titles and Labels
plt.title("Heat Map Example")
plt.xlabel("Variables")
plt.ylabel("Variables")
# Show Plot
plt.show()
Cont.
Charts (Bar, Line, Pie)
● Charts are fundamental tools in data visualization, representing data
graphically to make insights, patterns, and relationships more
understandable. Each chart type serves specific purposes depending on the
nature of the data and the insights needed.
● Bar Charts: Represent categorical data through rectangular bars, where the
length of the bar corresponds to the value of the data.
● Line Charts: Depict trends over time or continuous data through connected
data points with lines.
● Pie Charts: Visualize proportions or percentages of a whole, displayed as
slices of a circular pie.
Cont.
Python Implementation (bar chart)
import matplotlib.pyplot as plt
# Sample Data
categories = ['A', 'B', 'C', 'D']
values = [10, 15, 7, 20]
# Create Bar Chart
plt.bar(categories, values, color='skyblue')
plt.title("Bar Chart Example")
plt.xlabel("Categories")
plt.ylabel("Values")
plt.show()
Cont.
Python Implementation (Line chart)
import matplotlib.pyplot as plt
# Sample Data
time = [1, 2, 3, 4, 5]
values = [2, 4, 8, 16, 32]
# Create Line Chart
plt.plot(time, values, marker='o', color='green')
plt.title("Line Chart Example")
plt.xlabel("Time")
plt.ylabel("Values")
plt.show()
Cont.
Python Implementation (pie chart)
# Sample Data
plt.show()
Cont.
● Bar Charts:
○ Highlight comparisons and distributions across categories.
○ Simple to interpret for both small and large datasets.
● Line Charts:
○ Perfect for analyzing trends and predictions over time.
○ Reveal patterns like seasonality and rate of change.
● Pie Charts:
○ Intuitive for understanding proportions.
○ Useful for emphasizing a category’s dominance in a dataset.
Tree Maps
● A tree map is a visualization technique used to display hierarchical data using
nested rectangles. Each rectangle's size and color represent specific attributes of
the data, such as quantity, proportion, or category.
● Hierarchy Representation: Tree maps break down data into parent-child
relationships.
● Rectangles: The size of each rectangle is proportional to a value, providing a visual
summary of the data's structure and proportions.
● Compact Representation:
○ Tree maps make it easier to represent large, hierarchical datasets in a single view.
○ Ideal for comparing proportions across multiple categories and subcategories.
● Quick Insights:
○ Provides an at-a-glance understanding of dominant and minor components in the dataset.
● Space Efficiency:
○ Makes efficient use of limited screen space to display hierarchical data.
Cont.
● Hierarchical Display:
○ Represents data with multiple levels of hierarchy.
○ Parent categories are divided into child rectangles.
● Proportional Areas:
○ The area of each rectangle corresponds to the value it represents, helping to compare relative
sizes visually.
● Color Coding:
○ Different colors or gradients can represent additional dimensions, such as performance or
growth.
Cont.
Python Implementation
import squarify
import matplotlib.pyplot as plt
# Sample Data
categories = ['Electronics', 'Clothing', 'Home & Kitchen', 'Books', 'Sports']
values = [300, 200, 150, 100, 50]
# Normalize values for squarify
sizes = [value / sum(values) for value in values]
# Create Tree Map
colors = ['skyblue', 'lightgreen', 'gold', 'salmon', 'purple']
squarify.plot(sizes=values, label=categories, color=colors, alpha=0.8)
# Add Title
plt.title("Tree Map Example: Sales Performance")
plt.axis('off') # Hide axes
plt.show()
Cont.
Word Cloud and Network Diagrams
Word Cloud
● A word cloud is a visual representation of text data where the size of each word
reflects its frequency or importance in the dataset. Larger words occur more
frequently or have greater significance, while smaller words occur less frequently.
● Used for:
○ Sentiment Analysis: Analyze customer feedback to identify common themes or emotions.
■ How It Works: Frequently mentioned words like "excellent" or "poor" can provide insights into
customer sentiment.
○ Keyword Extraction: Extract key terms from large documents, such as research papers or reports.
■ How It Works: Important keywords like "machine learning" or "visualization" stand out due to
their size.
○ Content Analysis: Summarize the main topics in speeches, articles, or social media posts.
■ How It Works: Highlights the most discussed topics in the text.
Cont.
Python Implementation
from wordcloud import WordCloud
import matplotlib.pyplot as plt
# Sample text data
text = "data science visualization analytics machine learning AI data analytics"
# Generate the word cloud
wordcloud = WordCloud(width=800, height=400, background_color='white').generate(text)
# Display the word cloud
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off') # Hides axes
plt.title("Word Cloud Example")
plt.show()
Cont.
Network Diagrams
● A network diagram visually represents relationships or connections between entities
(nodes) using lines (edges). Nodes represent entities (e.g., people, places), and
edges represent relationships (e.g., friendships, transactions).
● Used for:
○ Social Network Analysis: Analyze relationships within a social network, such as identifying influencers.
■ How It Works: Nodes represent individuals, and edges represent friendships or interactions.
○ Supply Chain Mapping: Visualize the flow of goods and relationships between suppliers,
manufacturers, and retailers.
■ How It Works: Nodes represent entities in the supply chain, and edges represent transactions
or dependencies.
○ Communication Networks: Map email exchanges or phone call networks.
■ How It Works: Nodes represent individuals or devices, and edges represent communications.
Cont.
Python Implementation
import networkx as nx
import matplotlib.pyplot as plt
# Create a graph
G = nx.Graph()
# Add nodes
G.add_nodes_from(['A', 'B', 'C', 'D'])
# Add edges (relationships between nodes)
G.add_edges_from([('A', 'B'), ('B', 'C'), ('C', 'D'), ('D', 'A'), ('A', 'C')])
# Draw the network diagram
plt.figure(figsize=(8, 6))
nx.draw(G, with_labels=True, node_color='skyblue', node_size=2000, font_size=15, font_color='black', edge_color='gray')
plt.title("Network Diagram Example")
plt.show()