All Unit DV Notes
All Unit DV Notes
Unit-1
Chapter-1
1. Data Extraction:
1. The process of gathering data from various sources such as databases, web
scraping, APIs, or flat files (e.g., CSV, Excel).
2. Tools like SQL, Python libraries (e.g., pandas, BeautifulSoup), and ETL
(Extract, Transform, Load) frameworks are commonly used.
3. Goal: Collect relevant, high-quality data required for analysis.
2. Data Cleaning:
1. Involves identifying and correcting errors or inconsistencies in the dataset to
improve quality.
2. Steps include handling missing data, removing duplicates, correcting data types,
and fixing inconsistencies.
3. Tools: Python (pandas, NumPy), R, and specialized tools like OpenRefine.
4. Example: Replacing null values with averages or medians, removing outliers,
or standardizing text formats.
3. Data Annotation:
1. Process of labeling or tagging data to make it usable for machine learning
models or analysis.
2. Examples include tagging images, annotating sentiment in text, or marking key
phrases.
3. Tools: Label Studio, AWS SageMaker Ground Truth.
4. Often used in supervised machine learning to create training datasets.
1. Data Integration:
1. Combining data from multiple sources into a unified format or repository.
2. Ensures consistency and removes redundancy.
3. Examples: Merging sales and marketing data, integrating data from different
APIs.
4. Tools: ETL tools like Talend, Informatica, or Python-based solutions.
2. Data Reduction:
1. Reducing the volume of data while preserving its integrity and key features.
2. Techniques:
1. Feature selection: Choosing the most relevant attributes for analysis.
2. Dimensionality reduction: Techniques like Principal Component
Analysis (PCA) or t-SNE.
3. Sampling: Selecting a subset of the data.
3. Goal: Improve computational efficiency and focus on significant patterns.
3. Data Transformation:
1. Converting data into a suitable format or structure for analysis.
2. Includes normalization, standardization, encoding categorical variables, and
aggregating data.
3. Example: Converting timestamps into day-of-week features or one-hot
encoding categorical data.
1. Line Chart:
1. Used for displaying trends over time.
2. Example: Stock prices over a year.
2. Bar Chart:
1. Compares categorical data.
2. Example: Sales across regions.
3. Histogram:
1. Shows the distribution of a numerical variable.
2. Example: Age distribution of customers.
4. Pie Chart:
1. Displays proportions of a whole.
2. Example: Market share distribution.
5. Scatter Plot:
1. Depicts relationships or correlations between two variables.
2. Example: Age vs. income.
6. Box Plot:
1. Summarizes data distribution and identifies outliers.
2. Example: Exam scores distribution across classes.
7. Heatmap:
1. Visualizes data intensity or density using color gradients.
2. Example: Correlation matrix for variables.
Chapter-2
1. Scatterplot Matrix:
1. Displays scatterplots for all pairs of variables in a grid format.
2. Useful for identifying correlations or clusters.
2. Parallel Coordinates:
1. Represents each data point as a line passing through multiple parallel axes,
where each axis represents a variable.
2. Helps in detecting patterns or outliers across multiple dimensions.
3. Heatmaps:
1. Uses a matrix format where values are represented by varying colors.
2. Example: Correlation matrix for identifying relationships between variables.
4. 3D Scatter Plots:
1. Extends 2D scatter plots by adding a third dimension, often with interactive
rotation.
2. Tools: Matplotlib (Python), Tableau.
5. Bubble Charts:
1. Similar to scatter plots but adds a third variable through the size of the bubbles.
These techniques map each data value to a pixel or small graphical element. They are
particularly useful for large datasets.
1. Principle:
1. Pixels are arranged in a way that maintains spatial relationships or highlights
patterns.
2. Examples:
1. Recursive Patterns: Display data hierarchically, where each pixel represents a
data point, and clusters are recursively divided.
2. Color-Coded Pixels: Each pixel’s color intensity represents the magnitude of a
value.
3. Advantages:
1. Handles large datasets efficiently.
2. Allows for dense visual representation.
4. Limitations:
1. May require zooming or interaction for detailed analysis.
1. Principle:
1. Project high-dimensional data onto a 2D or 3D plane.
2. Techniques:
1. Principal Component Analysis (PCA): Reduces dimensions by capturing the
directions of maximum variance.
2. Multidimensional Scaling (MDS): Preserves pairwise distances between data
points in the projection.
3. t-SNE (t-Distributed Stochastic Neighbor Embedding): Focuses on retaining
local relationships, ideal for clustering.
3. Applications:
1. Visualizing clusters or patterns in high-dimensional datasets, such as gene
expression data or image embeddings.
These use icons to represent multidimensional data, with each attribute mapped to a visual
property of the icon.
1. Examples:
1. Chernoff Faces: Multivariate data is mapped to facial features like eyes, nose,
or mouth shapes.
2. Star Glyphs: Variables are represented as rays extending from a central point,
with the length of each ray indicating the value.
3. Stick Figures: Uses stick figures where limb positions or lengths represent
different attributes.
2. Advantages:
1. Makes multivariate data intuitive and memorable.
2. Useful for qualitative analysis and comparisons.
3. Limitations:
1. Hard to interpret when data is dense or highly dimensional.
• Types of Representations:
o Tree Diagrams: Traditional nodes and edges format to show parent-child
relationships.
o Treemaps: Uses nested rectangles to represent hierarchical levels, where size
and color encode data attributes.
o Sunburst Charts: Circular treemaps, where layers of the hierarchy are
represented as concentric rings.
• Applications:
o Visualizing file systems, organizational structures, or biological taxonomies.
• Advantages:
o Clear representation of hierarchical relationships.
o Facilitates navigation and comparison of sub-levels.
• Complex Data:
o Includes data with intricate relationships, temporal variations, or spatial
components.
o Examples: Social networks, financial markets, or sensor data.
• Techniques:
o Network Graphs: Use nodes and edges to represent entities and their
relationships. Examples include social networks or citation networks.
o Temporal Visualizations: Line graphs, Gantt charts, or time-series plots for
visualizing changes over time.
o Geospatial Visualizations: Maps with overlays of data points or heatmaps to
analyze location-based data.
• Tools for Handling Complexity:
o Interactive dashboards (e.g., Tableau, Power BI).
o Libraries like D3.js, Plotly, and Gephi for custom visualizations.
• Gestalt Principles:
o Explains how humans perceive patterns and groupings in visual elements.
o Key principles:
▪ Proximity: Elements close together are perceived as a group.
▪ Similarity: Similar shapes, colors, or sizes are grouped.
▪ Closure: Incomplete shapes are perceived as complete.
▪ Continuity: The eye follows continuous lines or paths.
▪ Figure-Ground: Differentiating an object (figure) from its background.
• Dual-Coding Theory:
o Suggests that humans process information through two systems: verbal and non-
verbal (visual).
o Combining visual and textual elements improves comprehension and memory
retention.
• Pre-Attentive Processing:
o Certain visual properties (e.g., color, size, orientation) are perceived quickly and
effortlessly.
o Useful for highlighting critical data points in a visualization.
• Cognitive Load Theory:
o The human brain has limited capacity for processing information.
o Effective visualizations minimize unnecessary elements to reduce cognitive
load.
• Color Perception Theory:
o Humans perceive colors differently based on context, brightness, and contrast.
o Using color effectively enhances clarity and prevents misinterpretation.
1. Data Types:
1. Nominal: Categories without a natural order (e.g., gender, regions).
2. Ordinal: Categories with a meaningful order (e.g., rankings, levels of
satisfaction).
3. Interval: Numeric data without a true zero (e.g., temperature in Celsius).
4. Ratio: Numeric data with a true zero (e.g., height, weight, income).
2. Visual Variables:
1. Position: Most effective for quantitative comparisons (e.g., bar positions on an
axis).
2. Size: Represents magnitude (e.g., bubble chart).
3. Shape: Differentiates categories (e.g., different markers in a scatterplot).
4. Color: Represents categories or gradients (e.g., heatmaps).
5. Orientation: Used for directional data (e.g., wind patterns).
6. Texture: Adds detail in qualitative data.
Chart Types
1. Bar Charts:
1. Compare categorical data.
2. Variants: Stacked bar chart, grouped bar chart.
2. Line Charts:
1. Track changes over time.
2. Variants: Multi-series line charts.
3. Pie Charts:
1. Show proportions of a whole. Best for simple datasets.
4. Scatter Plots:
1. Explore relationships between two variables.
5. Bubble Charts:
1. Add a third variable through bubble size.
6. Histogram:
1. Display frequency distributions for continuous data.
Statistical Graphs
1. Box Plots:
1. Summarize data distribution, including medians, quartiles, and outliers.
2. Histograms:
1. Visualize data distributions by dividing it into bins.
3. Violin Plots:
1. Combines box plots with a density plot for showing distribution shape.
4. Cumulative Distribution Function (CDF):
1. Represents cumulative probabilities.
Maps
1. Types of Maps:
1. Choropleth Maps: Use color gradients to represent data (e.g., population
density).
2. Heat Maps: Show data density or intensity over a geographic region.
3. Cartograms: Distort map areas to represent data magnitude.
4. Flow Maps: Visualize movement or connections (e.g., migration patterns).
2. Applications:
1. Geospatial data, demographic analysis, or logistics.
Trees and Networks
1. Trees:
1. Represent hierarchical data.
2. Visualizations:
1. Tree Diagrams: Nodes and edges show relationships.
2. Treemaps: Nested rectangles represent hierarchical proportions.
2. Networks:
1. Represent relationships between entities.
2. Visualizations:
1. Node-Link Diagrams: Nodes represent entities, and edges represent
relationships.
2. Adjacency Matrix: A grid format showing connections.
3. Force-Directed Graphs: Automatically arrange nodes based on their
relationships.
Chapter-3
1. Data Acquisition:
1. The process of collecting data from various sources for further analysis.
2. Methods:
1. Manual Data Entry: Human-recorded data (e.g., survey responses).
2. Automated Collection: Using sensors, web scraping, APIs, or data
logs.
3. Third-Party Sources: Acquiring datasets from vendors, government,
or research organizations.
2. Classification of Information Sources:
1. Primary Sources: Data collected directly from the source (e.g., experiments,
surveys).
2. Secondary Sources: Data gathered from existing works (e.g., research papers,
reports).
3. Tertiary Sources: Aggregations or summaries of primary and secondary
sources (e.g., encyclopedias).
4. By Format:
1. Structured (e.g., databases, spreadsheets).
2. Semi-structured (e.g., JSON, XML).
3. Unstructured (e.g., text, images).
Database Issues
• Continuous Variables:
o Variables that can take any value within a range (e.g., temperature, sales).
o Prediction Techniques:
▪ Linear Regression: Models the relationship between variables using a
straight line.
▪ Polynomial Regression: Fits a polynomial curve for non-linear
relationships.
▪ Neural Networks: Advanced models for complex, non-linear patterns.
• Discontinuous (Categorical) Variables:
o Variables that take discrete values (e.g., yes/no, classes).
o Prediction Techniques:
▪ Logistic Regression: For binary outcomes.
▪ Decision Trees: Splits data into categories based on conditions.
▪ Naive Bayes: Based on probability distribution.
▪ Support Vector Machines (SVMs): Finds decision boundaries for
classification tasks.
Chapter-1
Scalar and point visualization techniques are used to represent scalar (single-value) data and
point-based data.
Vector data contains magnitude and direction, commonly used in fields like physics,
meteorology, and fluid dynamics.
Multi-Dimensional Techniques
These techniques help visualize data with more than three dimensions.
13. Glyphs:
1. Small graphical representations where multiple attributes are mapped to shape,
size, color, or orientation.
2. Examples:
1. Chernoff Faces: Uses facial features to represent multivariate data.
2. Star Glyphs: Uses radial spokes to represent multiple variables.
14. Graph-Theoretic Graphics:
1. Visual representations of relationships between entities (nodes and edges).
2. Common techniques:
1. Node-Link Diagrams: Nodes (entities) are connected by edges
(relationships).
2. Adjacency Matrix: Uses a grid to indicate relationships between
entities.
3. Force-Directed Layouts: Dynamically position nodes based on their
connections (e.g., social networks).
15. A technique where multiple visualizations are linked, so interacting with one view
updates the others.
16. Common applications:
1. Selecting a data point in a scatter plot highlights related points in a parallel
coordinate plot.
2. Brushing in a histogram updates a corresponding time-series chart.
17. Tools supporting linked views:
1. Tableau, Power BI, D3.js.
18. Helps in understanding complex data distributions where standard plots may be too
cluttered.
19. Techniques:
1. Kernel Density Estimation (KDE): Smooths the distribution to reveal
density patterns.
2. Hexbin Plots: Aggregates point data into hexagonal bins to reduce clutter.
3. Contour Density Maps: Use isolines to represent density variations in 2D.
Used for 3D scalar fields like medical imaging (CT scans) and scientific simulations.
Attribute Mapping
Example Applications:
Cluster analysis is the process of grouping similar data points together. Effective
visualization techniques help in understanding the structure and relationships within clusters.
• Used in Agglomerative Hierarchical Clustering, where data points are merged step by step.
• The hierarchical tree structure shows how clusters form at different distance thresholds.
27. A heatmap with hierarchical clustering arranges similar data points together and uses color
intensity to represent relationships.
28. Frequently used in gene expression analysis and customer segmentation.
29. Each feature is represented by a vertical axis, and cluster groupings can be observed as
patterns across multiple dimensions.
2. Mosaic Plots
3. Matrix Visualizations
36. Market Basket Analysis uses graph-based visualizations to show relationships between
frequently bought products.
1. Posterior Distributions
37. Bayesian analysis updates prior beliefs using observed data to produce posterior
distributions.
38. Density Plots or Histograms represent posterior distributions to show probability spread.
39. Ensure visualizations do not mislead (e.g., avoiding truncated y-axes in bar charts).
40. Use error bars in plots to indicate variability.
1. Correlation Metrics: Evaluates how well relationships are preserved (e.g., Spearman’s
correlation in scatterplots).
2. Silhouette Score for Cluster Visualization: Measures how well clusters are separated.
3. Distortion Measures: Quantifies information loss in dimensionality reduction methods.
Chapter-2
Genetic networks model interactions between genes, proteins, and other biomolecules.
Visualization helps in understanding these complex relationships.
1. Graph-Based Representations
1. Used to visualize upregulated (red) and downregulated (blue) genes across conditions.
2. Clustering (Hierarchical Clustering + Heatmaps) identifies co-expressed genes.
3. Pathway Visualization
1. KEGG Pathway Maps: Show biochemical pathways where genes play a role.
2. Cytoscape: A tool to visualize molecular interactions.
3. Segmentation-Based Visualization
1. Region Growing & Thresholding: Used to highlight tumors or organs in medical images.
2. Deep Learning for Image Segmentation: AI-based detection of abnormalities.
4. 3D Medical Visualization Tools
Financial data visualization is crucial for trend analysis, risk assessment, and decision-
making.
1. Line Charts: Commonly used to track stock prices, exchange rates, and interest rates over
time.
2. Candlestick Charts: Represent open, high, low, and closing prices in stock markets.
3. Moving Averages & Bollinger Bands: Identify trends and volatility.
Insurance risk visualization helps assess claim patterns and policy risks.
2. Community Detection
1. Centrality Measures:
1. Degree Centrality: Number of direct connections.
2. Betweenness Centrality: Identifies "bridge" nodes.
3. Eigenvector Centrality: Measures influence based on connected nodes.
Chapter-3
a. Lists in HTML
b. Tables in HTML
Key Features:
XML (eXtensible Markup Language) is used for storing and transporting data in a structured,
readable format. It is especially useful in web-based environments for data interchange and
visualization.
Relevant Technologies:
Applications:
13. Browser-based interactive dashboards
14. SVG-based statistical maps and graphs
15. Converting raw XML data into visual elements
Google Maps API allows developers to embed Google Maps into web applications and
customize them with data layers for geographic data visualization.
Features:
16. Markers & Layers: Place markers, shapes (polygons, polylines), heatmaps, and
custom overlays on maps.
17. Geolocation: Visualize user location or data points with latitude and longitude.
18. Integration with Data Sources: Use data from external databases (like crime
statistics or weather data) and plot on maps.
19. Customization: Add interactivity (clickable info windows, custom icons, etc.)
Google Charts is a free tool for creating a variety of charts and graphs in web pages using
simple JavaScript and HTML.
Features:
23. Interactive: Charts are dynamic and allow zooming, hovering, and selection.
24. Wide Range of Chart Types: Line, bar, pie, combo, scatter, geo charts, treemaps,
timelines, etc.
25. Customizable: Control colors, labels, animations, and tooltips.
26. Data Integration: Easily integrates with data from Google Sheets, APIs, or manual
input.
Tableau is a leading commercial tool for advanced data visualization and business
intelligence (BI). It's widely used for turning raw data into interactive, shareable dashboards.
Features:
Summary Table:
Data rankings involve ordering or scoring data based on some criteria (e.g., ranking countries by
GDP, products by sales, students by grades). Tools for analyzing and visualizing rankings help to
uncover insights into performance, distribution, and comparisons.
Key Tools:
• Ranking Functions: RANK (), RANK.EQ(), and RANK.AVG () functions assign rank
numbers.
• Conditional Formatting: Highlights top/bottom performers using color scales.
• Charts:
o Bar charts for ordered comparisons
o Column charts showing ranked performance
o Sparklines to show ranking movements over time
b) Tableau / Power BI
34. Dynamic Ranking: Automatically ranks data based on selected metrics (e.g., Top N
products).
35. Interactive Dashboards:
1. Users can filter to show "Top 10", "Bottom 5", etc.
2. Parameters and calculated fields to dynamically adjust rankings.
36. Visual Representations:
1. Horizontal bar charts sorted by rank
2. Bump charts showing changes in ranking over time
43. Flourish (online visualization platform): Supports easy creation of ranking bar charts and
"bar chart races" (animated ranking over time).
44. Datawrapper: No-code tool for ranking visualizations and tables with sorting and
highlighting.
Key Methods:
45. Moving Average: Smooths out short-term fluctuations to reveal longer-term trends.
46. Exponential Smoothing: Gives more weight to recent observations for trend detection.
47. Seasonal Decomposition: Breaks data into trend, seasonal, and residual components.
Visualization:
b) Regression Analysis
50. Linear Regression: Models the relationship between independent and dependent variables
to identify trends.
51. Polynomial Regression: Fits non-linear trends.
52. Trendlines in Charts:
1. Add trendlines to scatter plots or time series to visually represent
upward/downward trends.
Tools: Excel, R (lm()), Python (sklearn.linear_model)
Visualization:
Visualization:
53. Grouping data to detect similar trend patterns (e.g., clustering customers by purchasing
behavior over time).
Visualization:
56. Smoothing algorithms (e.g., moving average, LOESS) to highlight overall trends.
57. Anomaly detection to spot sudden spikes or drops that may mask true trends.
Visualization:
Moving Average Line Smoothing data Line chart with moving average
Bar Chart Race Dynamic ranking + trend over time Animated race chart
Scatter Plot with Trendline Correlation and directionality Scatter with regression line
Key Techniques:
60. Definition: A grid of scatterplots, each showing a relationship between two variables.
61. Use Case: Quickly view pairwise relationships and spot correlations.
Tools:
65. Definition: Each variable has its own axis; lines represent observations passing through their
values on each axis.
66. Use Case: Detect clusters, outliers, and trends across many dimensions.
Tools:
• Definition: Extension of scatterplots where the size of the bubble represents a third variable.
• Use Case: Show relationships involving three continuous variables.
Tools:
• Google Charts
• Power BI
• Python’s plotly.express.scatter()
67. PCA (Principal Component Analysis): Reduces dimensions while preserving variance, then
visualized in 2D or 3D.
68. t-SNE / UMAP: Techniques for visualizing high-dimensional data in 2D/3D with attention to
clustering.
Tools:
Analyzing distributions reveals the spread, skewness, central tendency, and variability of the data.
a) Histograms
71. Definition: Bar charts representing the frequency of data points in bins.
72. Use Case: Understand shape, center, and spread.
Tools:
Tools:
c) Violin Plots
81. Definition: Combine box plot and density plot for richer distribution information.
82. Use Case: Compare distributions more precisely than boxplots alone.
Tools:
• Python’s seaborn.violinplot()
• R: vioplot package
Tools:
• Seaborn (kdeplot)
• R (density function)
83. Definition: Grid where cells are colored based on correlation coefficients between variables.
84. Use Case: Quickly identify strong/weak relationships.
Tools:
b) Scatter Plots
Tools:
90. Excel
91. Python (matplotlib, seaborn)
92. R (ggplot2)
c) Regression Lines
Tools:
Tools:
e) 3D Scatter Plots
Tools:
• Plotly (plotly.express.scatter_3d())
• Matplotlib 3D Toolkit
Spatial and geographic visualizations help map data linked to locations and reveal patterns across
regions.
a) Choropleth Maps
Tools:
1. Plotly (choropleth)
2. Tableau
3. Google Maps API
4. Leaflet.js (for web)
Tools:
Tools:
1. ArcGIS
2. QGIS
3. R’s tmap
d) Flow Maps (Movement Visualization)
Tools:
Tools:
Summary Table
Category Techniques / Tools Typical Visualizations
Spatial / Geographical Choropleth maps, Heatmaps, Flow maps, Geo-maps, density maps, 3D
Data Surface maps terrain