Data Exploration and Visualization Unit 3
Data Exploration and Visualization Unit 3
Overview
Data visualization is a crucial process in data analysis, enabling us to present data in a way that is
accessible and interpretable. It allows analysts to identify trends, patterns, and outliers,
facilitating decision-making. This chapter covers the fundamental concepts of data visualization,
including its stages, methods for processing and mapping data, and various visualization
techniques.
Visualizing data involves a structured approach that can be broken down into seven essential
stages:
Description: This initial stage is crucial for determining what you aim to achieve with
the visualization. Clear objectives help to focus the analysis.
Key Questions:
o What question am I trying to answer?
o Who is the audience for the visualization?
o What decisions will this visualization support?
Example: A marketing team wants to visualize customer demographics to tailor their
advertising strategies. Their objective is to identify which age groups are purchasing
specific products.
Description: This stage is where you create visual representations of your data. The
choice of visualization depends on the data type and the insights you want to convey.
Common Visualizations:
o Bar Charts: Useful for comparing categories.
o Line Charts: Ideal for showing trends over time.
o Heatmaps: Displaying data density across geographical locations or matrices.
Example: Creating a bar chart to compare sales performance across different product
lines.
Description: After visualizing the data, the next step is to interpret the results and
prepare to communicate them effectively. This often involves creating reports or
presentations.
Key Aspects:
o Highlighting Key Findings: Focus on the most significant insights.
o Storytelling: Use narrative techniques to guide the audience through the data.
o Visual Design: Ensure that visualizations are clear, accessible, and appealing.
Example: Presenting a dashboard that includes key metrics and visualizations, explaining
trends in sales and marketing effectiveness.
2. Getting Started with Processing
Data processing is a critical preliminary step that involves organizing, cleaning, and preparing
data for visualization.
Python:
o Libraries: Pandas for data manipulation, NumPy for numerical data operations.
o Example:
import pandas as pd
df = pd.read_csv('sales_data.csv')
R:
o Libraries: dplyr for data manipulation, tidyr for tidying data.
o Example:
library(dplyr)
3. Mapping
Mapping techniques help visualize spatial relationships and distributions within the data.
Choropleth Maps:
o Description: These maps use colors to represent data values in specific
geographical regions.
o Example: A choropleth map showing unemployment rates across different states,
where darker shades indicate higher unemployment.
Heatmaps:
o Description: Heatmaps indicate density or intensity of data points over a
geographical area.
o Example: A heatmap showing areas of high customer engagement in a city based
on social media activity.
Dot Maps:
o Description: Represent individual data points as dots on a map, providing a
visual indication of data distribution.
o Example: A dot map displaying the location of all customers within a region.
3.2 Creating Maps
import folium
m.save('map.html')
4. Data Exploration and Visualization - Detailed Notes
Time series data consists of observations made sequentially over time. Examples include stock
prices, temperature readings, and daily sales data.
Key Concepts:
Where 𝑀𝐴𝑡 the moving is average at time 𝑡, and 𝑥𝑡−𝑖 is the observation at lag 𝑖.
Where 𝑆𝑡 is the smoothed value, α is the smoothing factor, and 𝑥𝑡 is the current
observation.
Example: Suppose we have monthly sales data: January: 100, February: 120, March:
130. Using a 3-month moving average:
2. Connections and Correlations
Covariance:
Covariance measures how two variables move together. If the covariance is positive, both
variables tend to increase together; if it's negative, one increases while the other decreases.
Key Concepts:
Where:
X and Y = variables
𝑋̅ and 𝑌̅ = means of X and Y
n = number of data points
Example:
Where:
Example:
From the earlier covariance example, if σX=14.1 and σY=12.5, the correlation
would be:
y=mx+c
Where:
Calculation of Slope:
Example:
Let’s use the height and weight data again. After plotting the points, we can calculate the
trendline:
The trendline has a slope of 0.36, indicating that for each unit increase in height, weight
increases by 0.36.
Trees:
A tree is a hierarchical structure where each node has a parent (except the root) and may have
children. It’s used in many areas of computer science, including data structures (binary trees),
decision making (decision trees), and database indexing.
Hierarchies:
Recursion:
Recursion is a technique where a function calls itself to break down a problem into smaller
subproblems.
Example:
Factorial calculation:
A graph is a collection of nodes (vertices) and edges (connections) used to model pairwise
relations. Graphs can represent various structures like social networks, roads, and the internet.
Types of Graphs:
Degree Centrality:
Degree centrality measures the importance of a node based on how many connections it has.
𝐶𝐷 (v)=deg(v)
6. Acquiring Data
Data acquisition is the process of gathering data from various sources like sensors, databases,
APIs, or web scraping. Proper data acquisition is essential for ensuring quality and relevance.
Example:
To acquire stock price data, you can use APIs like Alpha Vantage or Yahoo Finance, which
allow you to retrieve real-time stock prices programmatically.
7. Parsing Data
Parsing involves processing raw data and converting it into a usable format. This can include
reading text, cleaning data, or extracting relevant parts of a dataset.
Example:
Web scraping using Python’s BeautifulSoup library can parse HTML pages and extract data
from specific tags.