21bcp420 ML Lab Report
21bcp420 ML Lab Report
SCHOOL OF TECHNOLOGY
G-12
LAB ASSIGNMENT-1
Objective: The objective of this lab assignment is to explore and analyze various data visualization tools
used for representing and understanding complex datasets. Through this assignment, you will gain
insights into the strengths, weaknesses, and practical applications of different visualization tools.
Tasks:
Key Concepts
• Data Types:
In essence, data visualization transforms raw data into a visual context, enabling better analysis
and understanding.
2. Selecting Data Visualization Tools: Research and select at least four different data visualization
tools. Examples include:
a. Matplotlib (Python)
b. Seaborn (Python)
c. ggplot2 (R)
d. Plotly (Python, R)
3. Provide Brief overview of the tool's capabilities and features.
a. Matplotlib (Python)
Capabilities and Features:
• 2D Plotting: Supports a wide range of static, animated, and interactive 2D plots, including
line plots, scatter plots, bar charts, histograms, and pie charts.
• Customization: Highly customizable plots, allowing control over every aspect of the plot.
• Integration: Integrates well with other Python libraries like NumPy, Pandas, and SciPy.
• Publication Quality: Can produce publication-quality figures in various formats and
interactive environments across platforms.
• Extensibility: Extensive range of third-party packages built on Matplotlib to extend its
functionality.
b. Seaborn (Python)
These tools each have their unique strengths and are suited to different aspects of data visualization,
from basic charting to complex, interactive dashboards.
4. provide at least one practical use case scenario where the tool would be particularly useful.
a. Matplotlib (Python)
• Scenario: A biologist is studying the growth patterns of bacteria under different conditions.
• Application: Using Matplotlib, the biologist can create detailed line plots to compare
growth curves across different experimental conditions. The ability to customize plots
extensively allows for clear presentation of results in research papers.
b. Seaborn (Python)
Use Case Scenario: Exploratory Data Analysis (EDA)
c. Plotly(R)
Scenario: A sales manager needs to monitor the performance of sales teams across different
regions in real-time.
Application: Power BI can be used to create an interactive dashboard that consolidates data from
various sources (CRM, ERP, etc.) to display key performance indicators (KPIs), sales trends, and
regional comparisons. The real-time data capabilities allow the sales manager to make timely
decisions based on the latest data.
d. ggplot2 (R)
Each of these scenarios highlights the unique strengths of the respective tools, demonstrating their
practical applications in real-world situations.
5. Discuss the strengths and weaknesses of each tool.
a. Matplotlib (Python)
Strengths:
• Complexity: Can be complex and verbose for creating more advanced plots.
• Steep Learning Curve: Requires a good understanding of Python and
Matplotlib's API.
• Less Interactive: Basic interactivity compared to other modern visualization
tools.
b. Seaborn (Python)
Strengths:
Ease of Use: Simplifies the creation of complex visualizations with fewer lines
of code.
Aesthetics: Provides attractive default themes and colour palettes.
• Integration: Seamlessly integrates with Pandas, making it easier to work with
structured data.
• Advanced Plots: Includes support for complex statistical plots.
Weaknesses:
tools.
c. Plotly(R)
Strengths:
Weaknesses:
d.ggplot2(R)
Strengths:
Grammar of Graphics: Provides a powerful, consistent framework for creating
complex visualizations.
Layered Approach: Facilitates building plots incrementally, allowing detailed
customization.
• Statistical Tools: Built-in support for various statistical transformations and
summaries.
• Integration: Works well with other R packages, especially those in the tidyverse.
Weaknesses:
These strengths and weaknesses highlight the unique capabilities and limitations of each
tool, helping users choose the right one based on their specific needs and expertise.
Code:
library(plotly)
library(dplyr)
library(htmlwidgets)
df <- read.csv("games_dataset.csv")
p2 <- df %>%
group_by(Genre) %>%
summarise(AverageRating = mean(User.Rating, na.rm = TRUE), .groups =
"drop") %>% plot_ly(
x = ~ reorder(Genre, -AverageRating), y = ~AverageRating, type =
"bar", marker = list(color = custom_colors)
) %>% layout(
title = "Average User Rating by Genre",
xaxis = list(title = "Genre"), yaxis = list(title = "Average User
Rating")
)
p3 <- df %>%
plot_ly(
x = ~Platform, y = ~User.Rating, type = "box", color =
~Platform,
colors = custom_colors ) %>% layout( title = "User Rating
Distribution by Platform", xaxis = list(title =
"Platform"), yaxis = list(title = "User Rating")
) saveWidget(p3, "plot3.html", selfcontained = FALSE)
p4 <- df %>%
plot_ly(
x = ~Release.Year, y = ~User.Rating, color = ~Genre, type = "scatter",
mode = "markers", colors = custom_colors
) %>% layout(
title = "User Ratings Over Release Years by Genre", xaxis =
list(title = "Release Year"), yaxis = list(title = "User Rating")
)
p5 <- df %>%
group_by(Platform, Genre) %>%
summarise(AverageRating = mean(User.Rating, na.rm = TRUE), .groups =
"drop") %>% plot_ly(
x = ~Platform, y = ~Genre, z = ~AverageRating, type =
"heatmap", colorscale = "Viridis" )
%>% layout( title = "Average User Rating by Platform and
Genre", xaxis = list(title = "Platform"), yaxis = list(title
= "Genre")
)
Output:
d.ggplot2(R)
Code:
library(ggplot2)
library(dplyr)
p2 <- df %>%
group_by(Genre) %>%
summarise(AverageRating = mean(User.Rating, na.rm = TRUE), .groups = "drop")
%>%
ggplot(aes(x = reorder(Genre, -AverageRating), y = AverageRating))
+ geom_bar(stat =
"identity") +
labs(title = "Average User Rating by Genre", x = "Genre", y = "Average User
Rating") +
theme_minimal() + coord_flip() print(p2)
p3 <- ggplot(df, aes(x = Platform, y = User.Rating)) +
geom_boxplot() +
labs(title = "User Rating Distribution by Platform", x = "Platform", y =
"User Rating") + theme_minimal() + theme(axis.text.x =
element_text(angle = 45, hjust = 1)) print(p3)
Output:
7. Based on your analysis, discuss which tools are better suited for different scenarios.
a. Matplotlib (Python)
Example Scenarios:
• A physicist creating detailed visualizations of experimental data.
• An engineer analyzing the results of simulations.
b. Seaborn (Python)
Example Scenarios:
• Interactive Data Visualization: Ideal for users who need to create interactive,
webbased visualizations that allow for dynamic data exploration, including 3D
plots and geographical maps.
• Real-Time Data Monitoring: Excellent for scenarios where real-time data
updates and visual exploration are required, such as dashboards that monitor live
data streams.
Example Scenarios:
d. ggplot2 (R)
Example Scenarios:
Summary
• Matplotlib: Best for scientific research and complex data analysis requiring
highly customizable plots.
• Seaborn: Ideal for exploratory data analysis and data storytelling with easy-to-
create, visually appealing statistical plots.
• Plotly: Suited for interactive data visualization, real-time data monitoring, and
webbased data exploration with a focus on interactivity and accessibility.
• ggplot2: Perfect for academic and public health research, as well as detailed
statistical analysis within the R ecosystem.
Choosing the right tool depends on the specific needs of the project, the user's expertise,
and the type of data being analysed.
LAB ASSIGNMENT-2
Objective: The objective of this lab assignment is to explore and analyze a dataset containing
measurements of electric power consumption in a household over a period of almost 4 years.
You will perform various data visualization tasks to gain insights into electrical quantities,
submetering values, and overall trends.
Task:
2. Subset the data from the given dates (December 2006 and November 2009)
3. Create a histogram
4. Create a Time series
The choice of the "optimum" visualization technique largely depends on the specific insights
or patterns you aim to uncover from the data. Here's a breakdown of the techniques you used
and their potential effectiveness:
1. Histogram of Global Active Power: This is ideal for showing the distribution of a
single numeric variable. It provides a clear view of how values are spread across
different ranges, which is particularly useful for identifying common ranges, skewness,
or outliers.
2. Time Series Plot of Global Active Power: This is excellent for observing trends over
time. If you're interested in how the global active power changes over December 2006
and November 2009, this plot gives a clear visualization of trends, spikes, and drops.
3. Plot for Sub Metering Over Time: Similar to the time series plot, this is useful for
comparing multiple time series (Sub_metering_1, Sub_metering_2, and
Sub_metering_3). It's ideal for seeing how different variables behave relative to each
other over time.
4. Scatterplot of Global Active Power vs. Voltage: Scatterplots are great for identifying
relationships or correlations between two variables. This plot can help you see if there's
any direct relationship (linear or non-linear) between active power and voltage.
5. Bar Chart of Sub Metering 1 by Date: Bar charts are good for comparing categorical
data or aggregating numeric data by categories (e.g., dates). This plot helps in
identifying patterns or anomalies in daily energy consumption.
6. Pie Chart of Sub Metering 1 Distribution: Pie charts are typically used to represent
proportions. They work well for categorical data but can be less effective when dealing
with a large number of categories or when proportions are very similar.
7. Count Plot of Platform: If "Platform" is a categorical variable, a count plot is optimal
for visualizing the frequency distribution of categories. It’s a straightforward way to see
how often each category appears in the data.
8. Boxplot of Global Active Power: Boxplots are highly effective for showing the
distribution of a numeric variable, including its median, quartiles, and outliers. They are
useful for comparing distributions across different subsets.
9. Heatmap of Correlations: This is an excellent way to visualize the correlation matrix
of multiple variables. It quickly shows which variables are strongly positively or
negatively correlated, which can be critical for multivariate analysis.
10. Distribution Plot of Global Active Power: Like the histogram, but with the added
benefit of a KDE (Kernel Density Estimate) curve. It provides a smoother view of the
distribution and is often more visually appealing for identifying distribution shapes.
11. Jointplot of Global Active Power and Voltage: A jointplot combines a scatterplot with
histograms (or KDEs) on the axes, offering a comprehensive view of the relationship
between two variables along with their distributions.
These techniques provide the most insight depending on your analysis goals. For overall
analysis, the heatmap is often the most informative for seeing correlations, while time series
plots and scatterplots are excellent for specific trend and relationship explorations.
LAB ASSIGNMENT-3
Title: Implement simple and multi-linear regression to predict profits for a food truck. Compare
the performance of the model on linear and multi-linear regression.
Objective: The objective of this lab assignment is to implement simple and multi-linear
regression models to predict profits for a food truck business. By comparing the performance
of these two regression models, you will gain insights into when and how to use simple and
multi-linear regression techniques.
Dataset Format:
Tasks:
Conclusion: The difference in performance between the two models is relatively minor.
However, Multiple Linear Regression slightly outperforms Simple Linear Regression in
explaining the variability in profits (as indicated by the R² value). Adding "Years in
Business" as a predictor does provide some value, but the improvement is not very
significant in this case.
LAB ASSIGNMENT-4
Title: Implement the non-parametric Locally Weighted Regression algorithm in order to fit
data points. Select appropriate data set for your experiment and draw graphs
Objective: To fit data points by assigning different weights to each point based on its proximity
to the query point.
Dataset: use a synthetic dataset with a sinusoidal pattern to showcase the capabilities of the
Locally Weighted Regression algorithm. You can generate a dataset by using python code.
np.random.seed(0)
X = np.sort(5 * np.random.rand(80, 1), axis=0)
y = np.sin(X).ravel() + np.random.normal(0, 0.1, X.shape[0])
Tasks:
1) Generate a dataset.
2) Split each dataset into features (X) and target variable (y).
4) Experiment using multiple query points across the range of the dataset.
5) Create a plot with the original dataset points and the fitted curves for different query
points and bandwidths.
LAB ASSIGNMENT-5
Title: For a given set of training data examples stored in a .CSV file, implement and
demonstrate various feature selection algorithms and compare the performance of the
algorithms.
Objective: The objective of this lab assignment is to implement and demonstrate various
feature selection algorithms on a given set of training data stored in a .CSV file. The goal is to
compare the performance of these algorithms in terms of improving model accuracy and
reducing dimensionality.
Tasks:
2) Visualize the performance metrics (e.g., accuracy) for each feature selection method
using appropriate plots (e.g., bar chart or line plot).
LAB ASSIGNMENT-6
Title: Apply different Machine Learning approach for the classification task. Compare the
performance of different ML approaches in term of accuracy, precision and recall.
Objective: The objective of this lab assignment is to apply various Machine Learning (ML)
approaches for a classification task and compare their performance in terms of accuracy,
precision, and recall. You will gain hands-on experience in implementing and evaluating
different ML algorithms, understanding their strengths and weaknesses, and interpreting their
results.
2. Data Preprocessing
3. Data Splitting
4. Model Training
5. Model Evaluation
Title: Train any machine learning classifier on the imbalanced dataset. Then balance the dataset
by using oversampling techniques. Compare the model performance before and after
oversampling.
Objective: In this lab assignment, you will work with an imbalanced dataset and train a
machine learning classifier on it. After that, you will apply oversampling techniques to balance
the dataset and compare the model's performance before and after oversampling. The goal is to
observe how oversampling affects the classifier's performance when dealing with imbalanced
data.
Tasks:
5. Choose a machine learning classifier of your choice. For example, you can use Logistic
Regression, Random Forest, or Support Vector Machine (SVM).
6. Train the chosen classifier on the imbalanced dataset and evaluate its performance on
the test set.
7. Apply oversampling techniques (e.g., Random Oversampling, SMOTE - Synthetic
Minority Over-sampling Technique) to balance the dataset.
8. Train the same classifier on the balanced dataset obtained after oversampling and
evaluate its performance on the test set.
9. Compare the performance metrics (e.g., accuracy, precision, recall, F1-score) of the
classifier before and after oversampling.
10. Discuss your observations and insights into how oversampling affects the model's
performance on the imbalanced dataset.
Overall, oversampling, when used correctly, helps improve the performance of models on
imbalanced datasets by providing a more representative learning environment, enabling the
classifier to better handle both classes.
LAB ASSIGNMENT-8
Title: Apply different feature selection approaches for the classification/regression task.
Compare the performance of different feature selection approach.
Objective: The objective of this lab assignment is to explore various feature selection
techniques for classification and regression tasks
Dataset: Use the UCI Iris dataset for the classification task and the California Housing
dataset for the regression task.
Tasks:
# Load Libraries
from sklearn.datasets import load_iris, fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import accuracy_score, mean_squared_error
from sklearn.feature_selection import SelectKBest, mutual_info_classif,
f_regression, RFE
import pandas as pd
import numpy as np
# Compare Performance
print("Iris Dataset (Classification)")
print(f"Accuracy with SelectKBest: {accuracy_iris_kbest}")
print(f"Accuracy with RFE: {accuracy_iris_rfe}")
For the regression task on the California Housing dataset, SelectKBest resulted in a
slightly lower MSE compared to RFE, suggesting that SelectKBest performed
marginally better in minimizing prediction errors.
Overall, both feature selection methods seem to work effectively, with SelectKBest showing a
slight edge in the regression task, while for the classification task, both methods achieved the
same level of performance.
LAB ASSIGNMENT-9
Title: Write a program to demonstrate the working of decision tree based CART algorithm.
Build the decision tree and classify a new sample using suitable dataset. Compare the
performance with that of ID, C4.5, and CART in terms of accuracy, recall, precision and
sensitivity.
Objective: The objective of this lab assignment is to implement the decision tree-based
Classification and Regression Trees (CART) algorithm and compare its performance with other
decision tree algorithms, namely ID3 and C4.5, in terms of accuracy, recall, precision, and
sensitivity. The assignment includes building decision trees, classifying new samples, and
evaluating the models using a suitable dataset.
Tasks:
import numpy as np
from collections import Counter
from sklearn.datasets import make_classification
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, recall_score, precision_score,
confusion_matrix
def get_split(dataset):
class_values = list(set(row[-1] for row in dataset))
b_index, b_value, b_score, b_groups = 999, 999, 999, None
for index in range(len(dataset[0]) - 1):
for row in dataset:
groups = test_split(index, row[index], dataset)
gini = gini_index(groups, class_values)
if gini < b_score:
b_index, b_value, b_score, b_groups = index, row[index], gini,
groups
return {'index': b_index, 'value': b_value, 'groups': b_groups}
# Display Results
print(f"CART Metrics: Accuracy={cart_metrics[0]:.3f},
Recall={cart_metrics[1]:.3f}, Precision={cart_metrics[2]:.3f},
Sensitivity={cart_metrics[3]:.3f}")
print(f"ID3 Metrics: Accuracy={id3_metrics[0]:.3f},
Recall={id3_metrics[1]:.3f}, Precision={id3_metrics[2]:.3f},
Sensitivity={id3_metrics[3]:.3f}")
print(f"C4.5 Metrics: Accuracy={c45_metrics[0]:.3f},
Recall={c45_metrics[1]:.3f}, Precision={c45_metrics[2]:.3f},
Sensitivity={c45_metrics[3]:.3f}")
Insights:
• The ID3 model achieves the highest accuracy (0.790), precision, recall, and sensitivity
among the three.
• The CART model has slightly lower accuracy but shows a relatively high sensitivity
(0.839), indicating its ability to correctly identify positive instances.
• The C4.5 model performance is close to CART in terms of accuracy but has a lower
sensitivity.
Overall, ID3 appears to be the best-performing model in this comparison based on accuracy
and balanced precision-recall metrics.
LAB ASSIGNMENT-10
Objective: The objective of this lab assignment is to implement the K-Means clustering
algorithm from scratch in Python and gain a deep understanding of how the algorithm works.
Tasks:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from scipy.spatial.distance import cdist
# Main flow
X = generate_data()
How It Works: The K-Means algorithm aims to minimize the within-cluster sum of squares
(WCSS), also known as inertia, which is the sum of squared distances between data points and
their respective cluster centroids.
LAB ASSIGNMENT-11
Tasks:
5) Visualize the dataset before and after PCA using scatterplots or other appropriate
visualizations.
6) Evaluate the impact of dimensionality reduction on the dataset's performance in a
machine learning task (e.g., classification or regression).
PCA (Principal Component Analysis) can be beneficial in various scenarios, particularly where
data dimensionality is high or complex. Here are some common use cases:
1. Data Visualization:
o PCA can reduce high-dimensional data to 2 or 3 principal components, making
it easier to visualize patterns or clusters in the data.
o This is particularly useful for exploring relationships and structures within
complex datasets.
2. Noise Reduction:
o PCA can help filter out noise by retaining only the components that carry
significant information, thus improving the quality of the data.
o This is often used in image processing and signal processing to enhance the
clarity of data.
3. Speeding Up Machine Learning Algorithms:
o By reducing the number of dimensions, PCA can speed up the training process
of machine learning models, especially when working with large datasets.
o This is crucial for algorithms like support vector machines (SVMs) or neural
networks that can become computationally expensive with high-dimensional
data.
4. Feature Selection:
o PCA can be used as a feature extraction method, allowing you to retain the most
informative features from the dataset.
o This helps in reducing overfitting and improving model generalization by
focusing on the most significant variables.
5. Multicollinearity Reduction:
o In datasets with highly correlated features, PCA can transform the features into
uncorrelated principal components.
o This is helpful in regression analysis to avoid issues like multicollinearity,
where predictors are linearly dependent on each other.
6. Anomaly Detection:
o PCA can be used to project data into a lower-dimensional space where
anomalies (outliers) are more distinguishable.
o This technique is often applied in fraud detection, network security, and
industrial quality control.
7. Image Compression:
oPCA is effective for reducing the size of images while preserving their essential
features, making it a common choice for image compression.
o This can help in scenarios where storage space is limited or when transferring
large datasets over the network.
8. Gene Expression Analysis:
o In bioinformatics, PCA is used to analyze high-dimensional gene expression
data, helping researchers identify patterns, group similar genes, and study
genetic variations.
o It simplifies the interpretation of genetic data and aids in understanding complex
biological processes.
These use cases highlight how PCA can be an essential tool for handling high-dimensional data
and improving the performance and interpretability of machine learning models.
LAB ASSIGNMENT-12
Link: https://cse22-iiith.vlabs.ac.in/exp/forward-neural-networks/
Tasks:
1) Read Theory
A Multilayer Feedforward Neural Network (MLFFNN) with two or more hidden layers and
nonlinear units can handle complex pattern classification tasks. It adjusts weights to map input
vectors to desired outputs through **backpropagation**, which minimizes the error between
actual and desired outputs using **gradient descent**.
In backpropagation, errors are calculated at the output layer and propagated backward to
update weights
Backpropagation can update weights either after each input (pattern mode) or after all inputs
(batch mode). Its performance depends on factors like initial weights, learning rate, and data
presentation. While convergence is not guaranteed, adjustments and stopping criteria help
guide the process.
Conclusion
The configuration of layers and neurons in an MLFFNN significantly affects the model's
performance and training dynamics. Striking the right balance is essential for achieving optimal
results, and experimenting with different architectures is vital to understanding how these
changes influence the network's ability to learn from data.
It's important to analyze the performance metrics and visualization plots (like loss curves) to
make informed decisions about model architecture.