Data Analysis and Visualization Exam Answers Summer 2022
Data Analysis and Visualization Exam Answers Summer 2022
Q.1(a) What is cluster? Explain types of cluster and cluster analysis with
proper example.
A cluster is a group of similar data points that are more related to each other
than to data points in other clusters.
Types of Clusters:
1. Centroid-based clusters: Groups organized around a central point (e.g.,
K-means)
2. Density-based clusters: Groups defined by dense areas separated by
sparse regions (e.g., DBSCAN)
3. Distribution-based clusters: Groups following certain statistical
distributions (e.g., Gaussian mixture models)
4. Hierarchical clusters: Nested groups arranged in a tree structure
Cluster Analysis is the process of grouping similar objects into clusters based
on their features.
Example: In customer segmentation, a retail company might cluster customers
based on purchasing behavior, identifying groups like "frequent high-value
shoppers," "occasional buyers," and "bargain hunters," allowing for targeted
marketing strategies.
Q.1(b) Figure out missing value from the given below table using KNN
algorithm and also visualize error and detection graph.
To find the missing Sub2_mark value for ID 3 using KNN:
1. Calculate distances between ID 3 and other data points:
o Distance to ID 1: √[(74-75)²] = 1
o Distance to ID 2: √[(74-62)²] = 12
o Distance to ID 4: √[(74-99)²] = 25
Country A B C D E
A 0 4 2 5 3
B 4 0 3 2 4
C 2 3 0 3 2
D 5 2 3 0 3
E 3 4 2 3 0
o D → B (distance: 2, capacity: 2)
2. Area Charts: Similar to line charts but with filled area below the line
o Example: Cumulative sales over quarters
2019001 85
2019002 72
2019003 90
2019004 68
2019005 95
2019006 78
Enrollment No. Mathematics Marks
2019007 82
2019008 88
2019009 75
2019010 92
Correlation Analysis
If we correlate enrollment number with marks:
Enrollment numbers: [2019001, 2019002, ..., 2019010]
Simplified as: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
Marks: [85, 72, 90, 68, 95, 78, 82, 88, 75, 92]
Correlation Coefficient (r) ≈ 0.23 (weak positive correlation)
Scatter Plot Description
X-axis: Student Number (1-10)
Y-axis: Mathematics Marks (60-100)
Pattern: Slightly upward trend with scattered points
Interpretation: Weak positive relationship between student number and
marks
OR Questions
Q.3(a) List out PCA algorithm steps with proper example. Why we use
PCA in data visualization? (03)
PCA Algorithm Steps
Step 1: Data Standardization
Center data by subtracting mean
Scale to unit variance
Formula: z = (x - μ) / σ
Step 2: Compute Covariance Matrix
Calculate covariance between all pairs of features
Matrix C = (1/n-1) × X^T × X
Step 3: Eigenvalue Decomposition
Find eigenvalues and eigenvectors of covariance matrix
Eigenvalues represent variance explained by each component
Step 4: Select Principal Components
Sort eigenvalues in descending order
Choose top k components that explain desired variance (e.g., 95%)
Step 5: Transform Data
Project original data onto selected principal components
New_data = Original_data × Selected_eigenvectors
Example
Original Data: Students with features [Math, Science, English, History]
Student 1: [85, 90, 78, 82]
Student 2: [72, 75, 85, 80]
After PCA: Reduce to 2 components explaining 90% variance
PC1 might represent "STEM ability"
PC2 might represent "Language ability"
Why Use PCA in Data Visualization?
1. Dimensionality Reduction: Reduces high-dimensional data to 2D/3D for
plotting
2. Noise Removal: Eliminates less important variations
3. Pattern Recognition: Reveals hidden patterns in data
4. Computational Efficiency: Faster processing with fewer dimensions
1200 180
1500 220
1800 280
2100 320
2400 380
2700 420
3000 480
Regression Analysis
Linear Model: Price = 0.15 × Size + 30
Visualization Components
1. Scatter Plot:
o X-axis: House Size (1000-3000 sq ft)
2. Regression Line:
o Best-fit line through data points
3. Confidence Intervals:
o Shaded area around regression line
4. Residual Analysis:
o Additional plot showing prediction errors
Model Evaluation
R-squared (R²): 0.92 (92% variance explained)
RMSE: $25,000 (average prediction error)
Interpretation: Strong positive relationship between size and price
Applications of Regression Visualization
1. Business Forecasting: Sales prediction models
2. Quality Control: Process optimization
3. Risk Assessment: Insurance premium calculation
4. Scientific Research: Hypothesis testing and validation
The visualization helps stakeholders understand the strength of relationships,
make predictions, and identify outliers or unusual patterns in the data.
Q.4 Data Classification and Neural Network Algorithms
Q.4(a) Define Classification. Explain difference between cluster and
classification with visualization. (03)
Classification Definition
Classification is a supervised machine learning technique that assigns
predefined class labels to data instances based on their features. It learns from
labeled training data to predict categories for new, unseen data.
Differences Between Clustering and Classification
Visualization Example
Clustering Visualization:
Data points grouped by similarity
○○ ●● △△
○ ● △
Cluster 1 Cluster 2 Cluster 3
(Unknown groups discovered)
Classification Visualization:
Data points with known labels
Memory Usage Low (stores weights only) High (stores entire dataset)
A 8 90 Pass
B 3 60 Fail
C 7 85 Pass
D 2 50 Fail
E 9 95 Pass
F 4 70 Fail
Algorithm Execution
1. Query Point: (6, 80)
2. Distances Calculated: All training points
3. K=3 Selected: C(5.1), A(10.2), E(15.3)
4. Voting: 3 Pass, 0 Fail
5. Prediction: PASS
OR Questions
Q.4(a) Define why we used most of time classification process in data
visualization. (03)
Why Classification in Data Visualization?
1. Pattern Recognition
Classification helps identify distinct groups in data
Makes complex datasets more understandable
Example: Customer segmentation visualization showing different buyer
personas
2. Decision Support
Provides clear categorical outcomes for business decisions
Simplifies complex relationships into actionable insights
Example: Medical diagnosis visualization showing disease vs. healthy
classifications
3. Data Exploration
Reveals hidden structures in data
Helps identify outliers and anomalies
Example: Fraud detection visualization highlighting suspicious
transactions
Benefits:
Clarity: Reduces complexity to understandable categories
Actionability: Provides specific recommendations
Communication: Easy to explain to stakeholders
Q.4(b) Describe KNN & ANN algorithm. Reason for using NN in data
analysis. (04)
KNN Algorithm
K-Nearest Neighbors is a lazy learning algorithm that classifies data points
based on the majority class of their K nearest neighbors.
Characteristics:
Instance-based learning
No explicit training phase
Distance-based predictions
Simple but effective
ANN Algorithm
Artificial Neural Network mimics brain neurons to learn complex patterns
through interconnected layers.
Architecture:
Input Layer → Hidden Layer(s) → Output Layer
Weights and biases adjusted during training
Uses backpropagation for learning
Reasons for Using NN in Data Analysis
1. Complex Pattern Recognition
Captures non-linear relationships
Handles high-dimensional data effectively
Example: Image recognition, speech processing
2. Adaptability
Learns from data without explicit programming
Improves performance with more training data
Handles noisy and incomplete data
3. Versatility
Works for classification, regression, clustering
Applicable across diverse domains
Applications: Finance, healthcare, marketing
4. Scalability
Handles large datasets efficiently
Parallel processing capabilities
Cloud computing integration
85 90 88
70 75 72
95 95 96
60 80 68
Training Process
1. Epoch 1: Error = 0.01, adjust weights
2. Epoch 2: Error = 0.008, continue training
3. Epoch N: Error < threshold, stop training
Final Result: Network learns to predict grades with high accuracy based on
midterm scores and attendance patterns.
Similar code found with 4 license types -
View matches
OR Questions
Q.5(a) Write with example how we visualize data using D3.js. (03)
D3.js Data Visualization Example
Basic Bar Chart Implementation
HTML Structure:
<!DOCTYPE html>
<html>
<head>
<script src="https://d3js.org/d3.v7.min.js"></script>
</head>
<body>
<div id="chart"></div>
</body>
</html>
JavaScript Code:
// Sample sales data
const salesData = [
{month: "Jan", sales: 120},
{month: "Feb", sales: 150},
{month: "Mar", sales: 180},
{month: "Apr", sales: 140},
{month: "May", sales: 200}
];
// Set dimensions and margins
const margin = {top: 20, right: 30, bottom: 40, left: 50};
const width = 500 - margin.left - margin.right;
const height = 300 - margin.top - margin.bottom;
// Create SVG container
const svg = d3.select("#chart")
.append("svg")
.attr("width", width + margin.left + margin.right)
.attr("height", height + margin.top + margin.bottom)
.append("g")
.attr("transform", `translate(${margin.left},${margin.top})`);
// Create scales
const xScale = d3.scaleBand()
.domain(salesData.map(d => d.month))
.range([0, width])
.padding(0.1);
const yScale = d3.scaleLinear()
.domain([0, d3.max(salesData, d => d.sales)])
.range([height, 0]);
// Create bars
svg.selectAll(".bar")
.data(salesData)
.enter().append("rect")
.attr("class", "bar")
.attr("x", d => xScale(d.month))
.attr("width", xScale.bandwidth())
.attr("y", d => yScale(d.sales))
.attr("height", d => height - yScale(d.sales))
.attr("fill", "steelblue");
// Add axes
svg.append("g")
.attr("transform", `translate(0,${height})`)
.call(d3.axisBottom(xScale));
svg.append("g")
.call(d3.axisLeft(yScale));
Result: Interactive bar chart showing monthly sales data with hover effects and
smooth animations.