0% found this document useful (0 votes)
12 views69 pages

21bcp420 ML Lab Report

Uploaded by

tejaspmistry1234
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views69 pages

21bcp420 ML Lab Report

Uploaded by

tejaspmistry1234
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 69

PANDIT DEENDAYAL ENERGY UNIVERSITY

SCHOOL OF TECHNOLOGY

Course: Machine Learning Lab


Course Code: 20CP401P

B.Tech. (Computer Science and Engineering)


Semester 7

Submitted To: Submitted By:

Sidheswar Akshi Vekaria


Routray
21BCP420

G-12
LAB ASSIGNMENT-1

Title: Discuss and analyze different data visualization tools.

Objective: The objective of this lab assignment is to explore and analyze various data visualization tools
used for representing and understanding complex datasets. Through this assignment, you will gain
insights into the strengths, weaknesses, and practical applications of different visualization tools.

Tasks:

1. Introduction to Data Visualization

What is Data Visualization?


Data visualization is the practice of representing data through visual elements like charts, graphs,
and maps to make information easily understandable.

Importance of Data Visualization


• Simplifies complex data: Makes large datasets comprehensible.

• Reveals insights: Highlights patterns and trends.


• Aids decision-making: Facilitates quicker and more informed decisions.
• Improves communication: Effectively conveys data stories and arguments.
• Enhances engagement: Interactive visuals capture and retain user interest.

Key Concepts
• Data Types:

o Quantitative Data: Numerical, measurable data. o


Qualitative Data: Descriptive, non-numerical data.

In essence, data visualization transforms raw data into a visual context, enabling better analysis
and understanding.

2. Selecting Data Visualization Tools: Research and select at least four different data visualization
tools. Examples include:
a. Matplotlib (Python)
b. Seaborn (Python)
c. ggplot2 (R)
d. Plotly (Python, R)
3. Provide Brief overview of the tool's capabilities and features.

a. Matplotlib (Python)
Capabilities and Features:

• 2D Plotting: Supports a wide range of static, animated, and interactive 2D plots, including
line plots, scatter plots, bar charts, histograms, and pie charts.
• Customization: Highly customizable plots, allowing control over every aspect of the plot.
• Integration: Integrates well with other Python libraries like NumPy, Pandas, and SciPy.
• Publication Quality: Can produce publication-quality figures in various formats and
interactive environments across platforms.
• Extensibility: Extensive range of third-party packages built on Matplotlib to extend its
functionality.

b. Seaborn (Python)

Capabilities and Features:

• Statistical Graphics: Built on top of Matplotlib, it specializes in making complex


statistical plots easier to create.
• Themes and Color Palettes: Provides built-in themes and color palettes to make
visualizations more attractive and informative.
• Data Frames: Integrates seamlessly with Pandas data frames, making it easier to work
with structured data.
• High-Level Interface: Simplifies the process of creating complex visualizations with
fewer lines of code.
• Advanced Visualizations: Supports advanced visualizations like heatmaps, violin plots,
and pair plots.
c. Plotly(R)

Capabilities and Features:

• Interactive Visualizations: A powerful tool for creating interactive and dynamic


visualizations that allow users to explore data through zooming, panning, and hovering.
• Wide Range of Plot Types: Supports various plot types, including line charts, scatter plots,
bar charts, heatmaps, contour plots, 3D plots, and geographical maps.
• Cross-Platform Compatibility: Integrates seamlessly with multiple programming
languages such as Python, R, MATLAB, and Julia, making it versatile for different
environments.
• Web-Based Visualizations: Enables the embedding of plots into web pages and
dashboards, making them accessible and interactive online.
• Real-Time Data Visualization: Capable of visualizing real-time data, ideal for live
dashboards and continuous monitoring.
d. ggplot2 (R)

Capabilities and Features:

• Grammar of Graphics: Implements the Grammar of Graphics, providing a powerful and


consistent way to describe and create complex visualizations.
• Layered Approach: Allows users to build plots incrementally using layers, making it easy
to customize and extend plots.
• Themes and Customization: Offers extensive customization options, including themes
and scales.
• Statistical Transformations: Supports built-in statistical transformations for summarizing
data.
• Integration: Integrates seamlessly with other R packages, especially those in the tidyverse
collection.

These tools each have their unique strengths and are suited to different aspects of data visualization,
from basic charting to complex, interactive dashboards.

4. provide at least one practical use case scenario where the tool would be particularly useful.

a. Matplotlib (Python)

Use Case Scenario: Scientific Research and Analysis

• Scenario: A biologist is studying the growth patterns of bacteria under different conditions.
• Application: Using Matplotlib, the biologist can create detailed line plots to compare
growth curves across different experimental conditions. The ability to customize plots
extensively allows for clear presentation of results in research papers.

b. Seaborn (Python)
Use Case Scenario: Exploratory Data Analysis (EDA)

• Scenario: A data scientist is exploring a new dataset on customer demographics and


purchasing behavior.
• Application: Seaborn can be used to create heatmaps to visualize correlations between
different variables, violin plots to show the distribution of purchase amounts across
different age groups, and pair plots to understand relationships between multiple variables
simultaneously. This helps in identifying trends and patterns quickly.

c. Plotly(R)

Use Case Scenario: Corporate Sales Dashboard

Scenario: A sales manager needs to monitor the performance of sales teams across different
regions in real-time.
Application: Power BI can be used to create an interactive dashboard that consolidates data from
various sources (CRM, ERP, etc.) to display key performance indicators (KPIs), sales trends, and
regional comparisons. The real-time data capabilities allow the sales manager to make timely
decisions based on the latest data.

d. ggplot2 (R)

Use Case Scenario: Public Health Data Visualization

• Scenario: An epidemiologist is analysing the spread of a disease across different regions


and demographics.
• Application: Using ggplot2, the epidemiologist can create multi-faceted plots that show
the incidence rates across different age groups, genders, and regions. The layered approach
of ggplot2 allows for adding trend lines, statistical summaries, and customized themes to
enhance the clarity and impact of the visualizations.

Each of these scenarios highlights the unique strengths of the respective tools, demonstrating their
practical applications in real-world situations.
5. Discuss the strengths and weaknesses of each tool.

a. Matplotlib (Python)

Strengths:

• Flexibility: Highly customizable, allowing fine control over plot appearance.


• Wide Range of Plots: Supports a variety of 2D plotting capabilities.
• Integration: Works well with other Python libraries like NumPy and Pandas.
• Output Formats: Can produce publication-quality figures in multiple formats.
Weaknesses:

• Complexity: Can be complex and verbose for creating more advanced plots.
• Steep Learning Curve: Requires a good understanding of Python and
Matplotlib's API.
• Less Interactive: Basic interactivity compared to other modern visualization

tools.

b. Seaborn (Python)

Strengths:

Ease of Use: Simplifies the creation of complex visualizations with fewer lines
of code.
Aesthetics: Provides attractive default themes and colour palettes.
• Integration: Seamlessly integrates with Pandas, making it easier to work with
structured data.
• Advanced Plots: Includes support for complex statistical plots.

Weaknesses:

• Limited Customization: Less flexible than Matplotlib for fine-tuning plot


details.
• Dependency on Matplotlib: Built on top of Matplotlib, which can sometimes
require understanding Matplotlib for deeper customizations.
• Performance: Can be slower with very large datasets compared to some other

tools.

c. Plotly(R)

Strengths:

• Interactivity: Highly interactive visualizations with support for zooming,


panning, and real-time data updates, enhancing data exploration.
• Customization: Extensive customization options allow for detailed control over
the appearance and behavior of plots.
• Cross-Platform Compatibility: Integrates seamlessly with multiple
programming languages (Python, R, MATLAB, etc.), making it versatile for
various environments.
• 3D Plotting and Mapping: Advanced support for 3D visualizations and
geographic mapping, providing powerful tools for complex data analysis.
• Web Integration: Plots can be easily embedded into web pages and dashboards,
enabling online accessibility and interactivity.

Weaknesses:

• Learning Curve: Requires some programming knowledge, particularly in


Python or R, which can be a barrier for non-programmers.
• Performance: May struggle with rendering very large datasets or highly
complex visualizations, affecting performance.
• Cost for Advanced Features: While Plotly's core functionality is open-source,
some advanced features and enterprise options require a paid subscription.
• Limited Built-In Data Transformation: Lacks built-in data cleansing and
transformation tools, requiring external processing before visualization.
• Complexity in Customization: While highly customizable, achieving the
desired outcome may require advanced programming skills, making it less
straightforward for beginners.

d.ggplot2(R)
Strengths:
Grammar of Graphics: Provides a powerful, consistent framework for creating
complex visualizations.
Layered Approach: Facilitates building plots incrementally, allowing detailed
customization.
• Statistical Tools: Built-in support for various statistical transformations and
summaries.
• Integration: Works well with other R packages, especially those in the tidyverse.

Weaknesses:

• Learning Curve: Requires understanding the Grammar of Graphics and R


programming.
• Performance: Can be slower with very large datasets.
• Less Interactive: Primarily designed for static plots, with limited interactivity
compared to tools like Power BI.

These strengths and weaknesses highlight the unique capabilities and limitations of each
tool, helping users choose the right one based on their specific needs and expertise.

6. Include visual examples (screenshots or code snippets) of data visualizations created


using each tool.
a.Matplotlib (python)
b.Seaborn(python)
c. plotly(R)

Code:
library(plotly)
library(dplyr)
library(htmlwidgets)

df <- read.csv("games_dataset.csv")

custom_colors <- c("red", "blue", "green", "purple", "orange", "pink",


"brown", "gray")
p1 <- df %>%
plot_ly(
x = ~Release.Year, y = ~User.Rating, type = "scatter", mode =
"lines+markers",
line = list(color = "blue"), marker = list(color = "red")
) %>% layout( title = "User Ratings Over Release
Years", xaxis = list(title =
"Release Year"), yaxis = list(title
= "User Rating")
)

saveWidget(p1, "plot1.html", selfcontained = FALSE)

p2 <- df %>%
group_by(Genre) %>%
summarise(AverageRating = mean(User.Rating, na.rm = TRUE), .groups =
"drop") %>% plot_ly(
x = ~ reorder(Genre, -AverageRating), y = ~AverageRating, type =
"bar", marker = list(color = custom_colors)
) %>% layout(
title = "Average User Rating by Genre",
xaxis = list(title = "Genre"), yaxis = list(title = "Average User
Rating")
)

saveWidget(p2, "plot2.html", selfcontained = FALSE)

p3 <- df %>%
plot_ly(
x = ~Platform, y = ~User.Rating, type = "box", color =
~Platform,
colors = custom_colors ) %>% layout( title = "User Rating
Distribution by Platform", xaxis = list(title =
"Platform"), yaxis = list(title = "User Rating")
) saveWidget(p3, "plot3.html", selfcontained = FALSE)

p4 <- df %>%
plot_ly(
x = ~Release.Year, y = ~User.Rating, color = ~Genre, type = "scatter",
mode = "markers", colors = custom_colors
) %>% layout(
title = "User Ratings Over Release Years by Genre", xaxis =
list(title = "Release Year"), yaxis = list(title = "User Rating")
)

saveWidget(p4, "plot4.html", selfcontained = FALSE)

p5 <- df %>%
group_by(Platform, Genre) %>%
summarise(AverageRating = mean(User.Rating, na.rm = TRUE), .groups =
"drop") %>% plot_ly(
x = ~Platform, y = ~Genre, z = ~AverageRating, type =
"heatmap", colorscale = "Viridis" )
%>% layout( title = "Average User Rating by Platform and
Genre", xaxis = list(title = "Platform"), yaxis = list(title
= "Genre")
)

saveWidget(p5, "plot5.html", selfcontained = FALSE)

Output:
d.ggplot2(R)
Code:
library(ggplot2)
library(dplyr)

df <- read.csv("games_dataset.csv") # Ensure the CSV file is in the same


directory or provide
the correct path

p1 <- ggplot(df, aes(x = Release.Year, y = User.Rating))


+ geom_line() + geom_point() + labs(title = "User Ratings Over Release
Years", x = "Release Year", y = "User Rating") + theme_minimal()
print(p1)

p2 <- df %>%
group_by(Genre) %>%
summarise(AverageRating = mean(User.Rating, na.rm = TRUE), .groups = "drop")
%>%
ggplot(aes(x = reorder(Genre, -AverageRating), y = AverageRating))
+ geom_bar(stat =
"identity") +
labs(title = "Average User Rating by Genre", x = "Genre", y = "Average User
Rating") +
theme_minimal() + coord_flip() print(p2)
p3 <- ggplot(df, aes(x = Platform, y = User.Rating)) +
geom_boxplot() +
labs(title = "User Rating Distribution by Platform", x = "Platform", y =
"User Rating") + theme_minimal() + theme(axis.text.x =
element_text(angle = 45, hjust = 1)) print(p3)

p4 <- ggplot(df, aes(x = Release.Year, y = User.Rating, color = Genre))


+ geom_point() +
labs(title = "User Ratings Over Release Years by Genre", x = "Release Year", y
= "User Rating") + theme_minimal() print(p4)

p5 <- df %>% group_by(Platform, Genre) %>% summarise(AverageRating =


mean(User.Rating, na.rm = TRUE), .groups = "drop") %>% ggplot(aes(x =
Platform, y = AverageRating, fill = Genre)) + geom_bar(stat =
"identity", position = "dodge") + labs(title = "Average User Rating by
Platform and Genre", x = "Platform", y = "Average User Rating")
+ theme_minimal() + theme(axis.text.x = element_text(angle = 45, hjust =
1)) print(p5)

Output:
7. Based on your analysis, discuss which tools are better suited for different scenarios.
a. Matplotlib (Python)

Best Suited For:

• Scientific Research: Ideal for researchers needing highly customizable,


publicationquality plots. Its flexibility and extensive customization options make
it perfect for detailed, precise visualizations.
• Complex Data Analysis: Suitable for projects requiring in-depth analysis and
visualization of complex data relationships.

Example Scenarios:
• A physicist creating detailed visualizations of experimental data.
• An engineer analyzing the results of simulations.

b. Seaborn (Python)

Best Suited For:

• Exploratory Data Analysis (EDA): Great for quickly generating visually


appealing and informative statistical plots during the initial phases of data
analysis.
• Data Storytelling: Useful for creating attractive, easy-to-understand
visualizations that communicate data insights effectively.

Example Scenarios:

• A data scientist exploring customer demographics and behavior patterns.


• A statistician visualizing the distribution of survey responses. c. Plotly ( R)

Best Suited For:

• Interactive Data Visualization: Ideal for users who need to create interactive,
webbased visualizations that allow for dynamic data exploration, including 3D
plots and geographical maps.
• Real-Time Data Monitoring: Excellent for scenarios where real-time data
updates and visual exploration are required, such as dashboards that monitor live
data streams.

Example Scenarios:

• A data analyst developing an interactive dashboard to monitor social media


metrics in real-time.
• A geospatial analyst visualizing geographic data with interactive maps that users
can explore online.

d. ggplot2 (R)

Best Suited For:


• Academic and Public Health Research: Ideal for researchers needing to create
complex, multi-faceted visualizations to present their findings clearly and
effectively. • Statistical Analysis: Well-suited for statisticians and data
analysts working within the R ecosystem who need powerful tools for statistical
data visualization.

Example Scenarios:

• An epidemiologist analyzing and visualizing the spread of diseases.


• A sociologist presenting survey data with detailed statistical summaries.

Summary

• Matplotlib: Best for scientific research and complex data analysis requiring
highly customizable plots.
• Seaborn: Ideal for exploratory data analysis and data storytelling with easy-to-
create, visually appealing statistical plots.
• Plotly: Suited for interactive data visualization, real-time data monitoring, and
webbased data exploration with a focus on interactivity and accessibility.
• ggplot2: Perfect for academic and public health research, as well as detailed
statistical analysis within the R ecosystem.

Choosing the right tool depends on the specific needs of the project, the user's expertise,
and the type of data being analysed.
LAB ASSIGNMENT-2

Title: Measurements of electric power consumption in one household with a one-minute


sampling rate over a period of almost 4 years. Different electrical quantities and some
submetering values are available.

Objective: The objective of this lab assignment is to explore and analyze a dataset containing
measurements of electric power consumption in a household over a period of almost 4 years.
You will perform various data visualization tasks to gain insights into electrical quantities,
submetering values, and overall trends.

Task:

1. Load the data

2. Subset the data from the given dates (December 2006 and November 2009)

3. Create a histogram
4. Create a Time series

5. Create a plot for sub metering


6. Create multiple plots, such as, Scatterplot, Histogram, Bar Chart, Pie Chart, Count
plot, Boxplot, Heatmap, Distplot, Jointplot
Which visualization techniques shows optimum visualization

The choice of the "optimum" visualization technique largely depends on the specific insights
or patterns you aim to uncover from the data. Here's a breakdown of the techniques you used
and their potential effectiveness:

1. Histogram of Global Active Power: This is ideal for showing the distribution of a
single numeric variable. It provides a clear view of how values are spread across
different ranges, which is particularly useful for identifying common ranges, skewness,
or outliers.
2. Time Series Plot of Global Active Power: This is excellent for observing trends over
time. If you're interested in how the global active power changes over December 2006
and November 2009, this plot gives a clear visualization of trends, spikes, and drops.
3. Plot for Sub Metering Over Time: Similar to the time series plot, this is useful for
comparing multiple time series (Sub_metering_1, Sub_metering_2, and
Sub_metering_3). It's ideal for seeing how different variables behave relative to each
other over time.
4. Scatterplot of Global Active Power vs. Voltage: Scatterplots are great for identifying
relationships or correlations between two variables. This plot can help you see if there's
any direct relationship (linear or non-linear) between active power and voltage.
5. Bar Chart of Sub Metering 1 by Date: Bar charts are good for comparing categorical
data or aggregating numeric data by categories (e.g., dates). This plot helps in
identifying patterns or anomalies in daily energy consumption.
6. Pie Chart of Sub Metering 1 Distribution: Pie charts are typically used to represent
proportions. They work well for categorical data but can be less effective when dealing
with a large number of categories or when proportions are very similar.
7. Count Plot of Platform: If "Platform" is a categorical variable, a count plot is optimal
for visualizing the frequency distribution of categories. It’s a straightforward way to see
how often each category appears in the data.
8. Boxplot of Global Active Power: Boxplots are highly effective for showing the
distribution of a numeric variable, including its median, quartiles, and outliers. They are
useful for comparing distributions across different subsets.
9. Heatmap of Correlations: This is an excellent way to visualize the correlation matrix
of multiple variables. It quickly shows which variables are strongly positively or
negatively correlated, which can be critical for multivariate analysis.
10. Distribution Plot of Global Active Power: Like the histogram, but with the added
benefit of a KDE (Kernel Density Estimate) curve. It provides a smoother view of the
distribution and is often more visually appealing for identifying distribution shapes.
11. Jointplot of Global Active Power and Voltage: A jointplot combines a scatterplot with
histograms (or KDEs) on the axes, offering a comprehensive view of the relationship
between two variables along with their distributions.

Optimal Visualization Techniques:

• Time Series Plot: Best for trend analysis over time.


• Heatmap of Correlations: Optimal for understanding relationships between multiple
variables.
• Scatterplot (or Jointplot): Best for examining the relationship between two variables.
• Boxplot: Effective for summarizing and comparing distributions, especially with
outliers.

These techniques provide the most insight depending on your analysis goals. For overall
analysis, the heatmap is often the most informative for seeing correlations, while time series
plots and scatterplots are excellent for specific trend and relationship explorations.
LAB ASSIGNMENT-3

Title: Implement simple and multi-linear regression to predict profits for a food truck. Compare
the performance of the model on linear and multi-linear regression.

Objective: The objective of this lab assignment is to implement simple and multi-linear
regression models to predict profits for a food truck business. By comparing the performance
of these two regression models, you will gain insights into when and how to use simple and
multi-linear regression techniques.

Dataset Format:

Population Years in business Profit


10,000 5 10000
15000 6 12000
20000 6 13000
9000 5 12000
12000 4 ?

Tasks:

1. Apply Simple Linear Regression Selecting.

2. Performance Evaluation (Simple Linear Regression).


3. Multi-Linear Regression.

4. Performance Evaluation (Multi-Linear Regression).

5. Model Comparison and Interpretation


• R-squared (R²): R² is a measure of how well the independent variables explain the
variability of the dependent variable. The R² value for Multiple Linear Regression
(0.5101) is slightly higher than that for Simple Linear Regression (0.4983),
indicating that the addition of "Years in Business" as an additional predictor slightly
improves the model's ability to explain profit variations.
• Mean Absolute Error (MAE) & Mean Squared Error (MSE): These metrics
indicate the error between predicted and actual values. While the MSE is slightly
lower for Multiple Linear Regression, the MAE is lower for Simple Linear
Regression, suggesting that, on average, the simple model may make smaller errors
but might perform worse on larger errors (since MSE penalizes larger errors more
heavily).

Conclusion: The difference in performance between the two models is relatively minor.
However, Multiple Linear Regression slightly outperforms Simple Linear Regression in
explaining the variability in profits (as indicated by the R² value). Adding "Years in
Business" as a predictor does provide some value, but the improvement is not very
significant in this case.
LAB ASSIGNMENT-4

Title: Implement the non-parametric Locally Weighted Regression algorithm in order to fit
data points. Select appropriate data set for your experiment and draw graphs

Objective: To fit data points by assigning different weights to each point based on its proximity
to the query point.

Dataset: use a synthetic dataset with a sinusoidal pattern to showcase the capabilities of the
Locally Weighted Regression algorithm. You can generate a dataset by using python code.
np.random.seed(0)
X = np.sort(5 * np.random.rand(80, 1), axis=0)
y = np.sin(X).ravel() + np.random.normal(0, 0.1, X.shape[0])

Tasks:

1) Generate a dataset.

2) Split each dataset into features (X) and target variable (y).

3) Implement the Locally Weighted Regression algorithm

4) Experiment using multiple query points across the range of the dataset.
5) Create a plot with the original dataset points and the fitted curves for different query
points and bandwidths.
LAB ASSIGNMENT-5

Title: For a given set of training data examples stored in a .CSV file, implement and
demonstrate various feature selection algorithms and compare the performance of the
algorithms.

Objective: The objective of this lab assignment is to implement and demonstrate various
feature selection algorithms on a given set of training data stored in a .CSV file. The goal is to
compare the performance of these algorithms in terms of improving model accuracy and
reducing dimensionality.

Dataset: Load a dataset of your choice or generate a synthetic dataset.

Tasks:

1) Implement and demonstrate the following feature selection algorithms:

➢ Univariate feature selection (e.g., SelectKBest with chi-squared or mutual


information scores)
➢ Recursive feature elimination (RFE)

➢ L1-based feature selection (Lasso regularization)

➢ Tree-based feature selection (Random Forest or XGBoost feature importance)

2) Visualize the performance metrics (e.g., accuracy) for each feature selection method
using appropriate plots (e.g., bar chart or line plot).
LAB ASSIGNMENT-6

Title: Apply different Machine Learning approach for the classification task. Compare the
performance of different ML approaches in term of accuracy, precision and recall.

Objective: The objective of this lab assignment is to apply various Machine Learning (ML)
approaches for a classification task and compare their performance in terms of accuracy,
precision, and recall. You will gain hands-on experience in implementing and evaluating
different ML algorithms, understanding their strengths and weaknesses, and interpreting their
results.

1. Data Loading and Exploration

2. Data Preprocessing

3. Data Splitting
4. Model Training

5. Model Evaluation

6. Visualization, Comparison and Interpretation


LAB ASSIGNMENT-7

Title: Train any machine learning classifier on the imbalanced dataset. Then balance the dataset
by using oversampling techniques. Compare the model performance before and after
oversampling.

Objective: In this lab assignment, you will work with an imbalanced dataset and train a
machine learning classifier on it. After that, you will apply oversampling techniques to balance
the dataset and compare the model's performance before and after oversampling. The goal is to
observe how oversampling affects the classifier's performance when dealing with imbalanced
data.

Tasks:

1. Load the dataset


2. Explore the dataset and analyze the class distribution to verify the imbalance.
3. Preprocess the data (handle missing values, convert categorical variables, etc.) if
necessary.

4. Split the dataset into training and testing sets.

5. Choose a machine learning classifier of your choice. For example, you can use Logistic
Regression, Random Forest, or Support Vector Machine (SVM).
6. Train the chosen classifier on the imbalanced dataset and evaluate its performance on
the test set.
7. Apply oversampling techniques (e.g., Random Oversampling, SMOTE - Synthetic
Minority Over-sampling Technique) to balance the dataset.
8. Train the same classifier on the balanced dataset obtained after oversampling and
evaluate its performance on the test set.

9. Compare the performance metrics (e.g., accuracy, precision, recall, F1-score) of the
classifier before and after oversampling.

10. Discuss your observations and insights into how oversampling affects the model's
performance on the imbalanced dataset.

Observations and Insights into How Oversampling Affects Model Performance:


1. Improved Recall for Minority Class: After applying oversampling techniques like
SMOTE, the model generally improves in recognizing instances of the minority class.
This is because oversampling creates synthetic data points for the underrepresented
class, leading the model to encounter more examples and thus better learn its patterns.
2. Balanced Precision and Recall: While oversampling often improves recall, it may
sometimes reduce precision slightly, as the model is exposed to more synthetic samples
that might not perfectly match real-world examples. However, overall, the F1-score—
a balance between precision and recall—tends to improve after oversampling,
indicating a better trade-off between the two metrics.
3. Reduced Bias: Before oversampling, models tend to be biased towards the majority
class due to the imbalance in data distribution. This results in high accuracy but poor
performance on the minority class. Oversampling reduces this bias, making the
classifier more balanced in its predictions, which is often reflected in improved
precision, recall, and F1-score.
4. Increased Training Time: Oversampling can increase the size of the training dataset,
which can result in longer training times. This is an important trade-off to consider,
especially for large datasets where training time is a critical factor.
5. Potential Overfitting: A downside of oversampling is that it can lead to overfitting,
especially with techniques like random oversampling where duplicates of the minority
class are created. The classifier might learn to overly rely on the repeated samples, thus
performing well on training data but not generalizing as effectively on unseen data.
Techniques like SMOTE can help mitigate this by generating synthetic data points
rather than just duplicates.

Overall, oversampling, when used correctly, helps improve the performance of models on
imbalanced datasets by providing a more representative learning environment, enabling the
classifier to better handle both classes.
LAB ASSIGNMENT-8

Title: Apply different feature selection approaches for the classification/regression task.
Compare the performance of different feature selection approach.

Objective: The objective of this lab assignment is to explore various feature selection
techniques for classification and regression tasks

Dataset: Use the UCI Iris dataset for the classification task and the California Housing
dataset for the regression task.

Tasks:
# Load Libraries
from sklearn.datasets import load_iris, fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import accuracy_score, mean_squared_error
from sklearn.feature_selection import SelectKBest, mutual_info_classif,
f_regression, RFE
import pandas as pd
import numpy as np

# Load Iris dataset (classification)


iris = load_iris()
iris_df = pd.DataFrame(data=iris['data'], columns=iris['feature_names'])
iris_df['target'] = iris['target']

# Load California Housing dataset (regression)


california_housing = fetch_california_housing()
california_df = pd.DataFrame(data=california_housing['data'],
columns=california_housing['feature_names'])
california_df['target'] = california_housing['target']

# Preprocessing: Handle missing values (no missing values in both datasets),


standardize features
scaler = StandardScaler()

# Preprocessing Iris dataset (Classification)


X_iris = scaler.fit_transform(iris_df.drop('target', axis=1))
y_iris = iris_df['target']

# Preprocessing California Housing dataset (Regression)


X_california = scaler.fit_transform(california_df.drop('target', axis=1))
y_california = california_df['target']
# Split datasets into training and testing sets
X_train_iris, X_test_iris, y_train_iris, y_test_iris =
train_test_split(X_iris, y_iris, test_size=0.3, random_state=42)
X_train_california, X_test_california, y_train_california, y_test_california =
train_test_split(X_california, y_california, test_size=0.3, random_state=42)

# Feature Selection for Iris dataset (Classification)


## SelectKBest with mutual_info_classif
select_kbest_iris = SelectKBest(mutual_info_classif, k=2)
X_train_iris_kbest = select_kbest_iris.fit_transform(X_train_iris,
y_train_iris)
X_test_iris_kbest = select_kbest_iris.transform(X_test_iris)

# Train Logistic Regression using selected features


log_reg = LogisticRegression()
log_reg.fit(X_train_iris_kbest, y_train_iris)
y_pred_iris_kbest = log_reg.predict(X_test_iris_kbest)
accuracy_iris_kbest = accuracy_score(y_test_iris, y_pred_iris_kbest)

# Recursive Feature Elimination (RFE) for Iris dataset


rfe_iris = RFE(LogisticRegression(), n_features_to_select=2)
X_train_iris_rfe = rfe_iris.fit_transform(X_train_iris, y_train_iris)
X_test_iris_rfe = rfe_iris.transform(X_test_iris)

# Train Logistic Regression using RFE-selected features


log_reg.fit(X_train_iris_rfe, y_train_iris)
y_pred_iris_rfe = log_reg.predict(X_test_iris_rfe)
accuracy_iris_rfe = accuracy_score(y_test_iris, y_pred_iris_rfe)

# Feature Selection for California Housing dataset (Regression)


## SelectKBest with f_regression
select_kbest_california = SelectKBest(f_regression, k=5)
X_train_california_kbest =
select_kbest_california.fit_transform(X_train_california, y_train_california)
X_test_california_kbest = select_kbest_california.transform(X_test_california)

# Train Linear Regression using selected features


lin_reg = LinearRegression()
lin_reg.fit(X_train_california_kbest, y_train_california)
y_pred_california_kbest = lin_reg.predict(X_test_california_kbest)
mse_california_kbest = mean_squared_error(y_test_california,
y_pred_california_kbest)

# Recursive Feature Elimination (RFE) for California Housing dataset


rfe_california = RFE(LinearRegression(), n_features_to_select=5)
X_train_california_rfe = rfe_california.fit_transform(X_train_california,
y_train_california)
X_test_california_rfe = rfe_california.transform(X_test_california)
# Train Linear Regression using RFE-selected features
lin_reg.fit(X_train_california_rfe, y_train_california)
y_pred_california_rfe = lin_reg.predict(X_test_california_rfe)
mse_california_rfe = mean_squared_error(y_test_california,
y_pred_california_rfe)

# Compare Performance
print("Iris Dataset (Classification)")
print(f"Accuracy with SelectKBest: {accuracy_iris_kbest}")
print(f"Accuracy with RFE: {accuracy_iris_rfe}")

print("\nCalifornia Housing Dataset (Regression)")


print(f"MSE with SelectKBest: {mse_california_kbest}")
print(f"MSE with RFE: {mse_california_rfe}")

1. Iris Dataset (Classification):


o Accuracy with SelectKBest: 1.0
o Accuracy with RFE (Recursive Feature Elimination): 1.0

This indicates that both feature selection methods—SelectKBest and RFE—achieved


perfect classification accuracy on the Iris dataset.

2. California Housing Dataset (Regression):


o MSE (Mean Squared Error) with SelectKBest: 0.5317358993623594
o MSE with RFE: 0.5432160285742256

For the regression task on the California Housing dataset, SelectKBest resulted in a
slightly lower MSE compared to RFE, suggesting that SelectKBest performed
marginally better in minimizing prediction errors.

Overall, both feature selection methods seem to work effectively, with SelectKBest showing a
slight edge in the regression task, while for the classification task, both methods achieved the
same level of performance.
LAB ASSIGNMENT-9

Title: Write a program to demonstrate the working of decision tree based CART algorithm.
Build the decision tree and classify a new sample using suitable dataset. Compare the
performance with that of ID, C4.5, and CART in terms of accuracy, recall, precision and
sensitivity.

Objective: The objective of this lab assignment is to implement the decision tree-based
Classification and Regression Trees (CART) algorithm and compare its performance with other
decision tree algorithms, namely ID3 and C4.5, in terms of accuracy, recall, precision, and
sensitivity. The assignment includes building decision trees, classifying new samples, and
evaluating the models using a suitable dataset.

Dataset: Load a dataset of your choice or generate a synthetic dataset.

Tasks:

import numpy as np
from collections import Counter
from sklearn.datasets import make_classification
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, recall_score, precision_score,
confusion_matrix

# 1. Generate a synthetic dataset


X, y = make_classification(n_samples=1000, n_features=10, n_informative=8,
n_classes=2, random_state=42)

# Split the dataset into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=42)

# 2. CART Implementation (Gini Index)

def gini_index(groups, classes):


# Count all samples
n_instances = sum([len(group) for group in groups])
# Gini calculation
gini = 0.0
for group in groups:
size = len(group)
if size == 0:
continue
score = 0.0
group_labels = [row[-1] for row in group]
for class_val in classes:
p = group_labels.count(class_val) / size
score += p * p
gini += (1.0 - score) * (size / n_instances)
return gini

def test_split(index, value, dataset):


left, right = [], []
for row in dataset:
if row[index] < value:
left.append(row)
else:
right.append(row)
return left, right

def get_split(dataset):
class_values = list(set(row[-1] for row in dataset))
b_index, b_value, b_score, b_groups = 999, 999, 999, None
for index in range(len(dataset[0]) - 1):
for row in dataset:
groups = test_split(index, row[index], dataset)
gini = gini_index(groups, class_values)
if gini < b_score:
b_index, b_value, b_score, b_groups = index, row[index], gini,
groups
return {'index': b_index, 'value': b_value, 'groups': b_groups}

# Updated terminal node value


def to_terminal(group):
if not group: # If the group is empty
return None # Return None or handle it appropriately
outcomes = [row[-1] for row in group]
return Counter(outcomes).most_common(1)[0][0]

# Updated split function to handle empty groups


def split(node, max_depth, min_size, depth):
left, right = node['groups']
del(node['groups'])

# Check if there are no more splits possible


if not left or not right:
node['left'] = node['right'] = to_terminal(left + right)
return

# Check if maximum depth is reached


if depth >= max_depth:
node['left'], node['right'] = to_terminal(left), to_terminal(right)
return

# Process left child


if len(left) <= min_size:
node['left'] = to_terminal(left)
else:
node['left'] = get_split(left)
split(node['left'], max_depth, min_size, depth+1)

# Process right child


if len(right) <= min_size:
node['right'] = to_terminal(right)
else:
node['right'] = get_split(right)
split(node['right'], max_depth, min_size, depth+1)

def build_tree(train, max_depth, min_size):


root = get_split(train)
split(root, max_depth, min_size, 1)
return root

def predict(node, row):


if row[node['index']] < node['value']:
if isinstance(node['left'], dict):
return predict(node['left'], row)
else:
return node['left']
else:
if isinstance(node['right'], dict):
return predict(node['right'], row)
else:
return node['right']

# 3. Build CART model


train_data = np.column_stack((X_train, y_train))
cart_tree = build_tree(train_data, max_depth=5, min_size=10)
cart_predictions = [predict(cart_tree, row) for row in
np.column_stack((X_test, y_test))]

# 4. ID3 and C4.5 using scikit-learn

# ID3 using entropy


id3_tree = DecisionTreeClassifier(criterion='entropy', random_state=42)
id3_tree.fit(X_train, y_train)
id3_predictions = id3_tree.predict(X_test)

# C4.5: Using DecisionTreeClassifier with pruning (using min_samples_split)


c45_tree = DecisionTreeClassifier(criterion='entropy', min_samples_split=10,
random_state=42)
c45_tree.fit(X_train, y_train)
c45_predictions = c45_tree.predict(X_test)
# 5. Evaluation

def evaluate_performance(y_true, y_pred):


accuracy = accuracy_score(y_true, y_pred)
recall = recall_score(y_true, y_pred, average='macro')
precision = precision_score(y_true, y_pred, average='macro')
cm = confusion_matrix(y_true, y_pred)
sensitivity = cm[1,1] / (cm[1,1] + cm[1,0]) if (cm[1,1] + cm[1,0]) != 0
else 0
return accuracy, recall, precision, sensitivity

# Evaluate each model


cart_metrics = evaluate_performance(y_test, cart_predictions)
id3_metrics = evaluate_performance(y_test, id3_predictions)
c45_metrics = evaluate_performance(y_test, c45_predictions)

# Display Results
print(f"CART Metrics: Accuracy={cart_metrics[0]:.3f},
Recall={cart_metrics[1]:.3f}, Precision={cart_metrics[2]:.3f},
Sensitivity={cart_metrics[3]:.3f}")
print(f"ID3 Metrics: Accuracy={id3_metrics[0]:.3f},
Recall={id3_metrics[1]:.3f}, Precision={id3_metrics[2]:.3f},
Sensitivity={id3_metrics[3]:.3f}")
print(f"C4.5 Metrics: Accuracy={c45_metrics[0]:.3f},
Recall={c45_metrics[1]:.3f}, Precision={c45_metrics[2]:.3f},
Sensitivity={c45_metrics[3]:.3f}")

Insights:

• The ID3 model achieves the highest accuracy (0.790), precision, recall, and sensitivity
among the three.
• The CART model has slightly lower accuracy but shows a relatively high sensitivity
(0.839), indicating its ability to correctly identify positive instances.
• The C4.5 model performance is close to CART in terms of accuracy but has a lower
sensitivity.

Overall, ID3 appears to be the best-performing model in this comparison based on accuracy
and balanced precision-recall metrics.
LAB ASSIGNMENT-10

Title: Write a python program to implement K-Means clustering Algorithm.

Objective: The objective of this lab assignment is to implement the K-Means clustering
algorithm from scratch in Python and gain a deep understanding of how the algorithm works.

Dataset: Load a dataset of your choice or generate a synthetic dataset.

Tasks:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from scipy.spatial.distance import cdist

# 1. Generate synthetic dataset


def generate_data():
X, y = make_blobs(n_samples=300, centers=4, cluster_std=0.60,
random_state=0)
return X

# 2. K-Means implementation from scratch


class KMeansScratch:
def __init__(self, k, max_iters=100):
self.k = k
self.max_iters = max_iters

def initialize_centroids(self, X):


# Randomly choose k points as initial centroids
indices = np.random.choice(X.shape[0], self.k, replace=False)
return X[indices]

def assign_clusters(self, X, centroids):


# Compute distances from points to centroids and assign clusters
distances = cdist(X, centroids, 'euclidean')
return np.argmin(distances, axis=1)

def update_centroids(self, X, labels):


# Compute new centroids as mean of points in each cluster
new_centroids = np.array([X[labels == i].mean(axis=0) for i in
range(self.k)])
return new_centroids

def fit(self, X):


centroids = self.initialize_centroids(X)
for _ in range(self.max_iters):
labels = self.assign_clusters(X, centroids)
new_centroids = self.update_centroids(X, labels)
if np.all(centroids == new_centroids):
break
centroids = new_centroids
return centroids, labels

# 3. Calculate inertia (sum of squared distances to the closest centroid)


def calculate_inertia(X, centroids, labels):
inertia = 0
for i in range(len(centroids)):
inertia += np.sum((X[labels == i] - centroids[i]) ** 2)
return inertia

# 4. Visualizing the dataset and initial centroids


def visualize_clusters(X, centroids, labels):
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='rainbow')
plt.scatter(centroids[:, 0], centroids[:, 1], s=300, c='black',
marker='x')
plt.title('K-Means Clustering')
plt.show()

# 5. Determining optimal K using elbow method


def elbow_method(X, max_k):
inertias = []
for k in range(1, max_k + 1):
kmeans = KMeansScratch(k)
centroids, labels = kmeans.fit(X)
inertia = calculate_inertia(X, centroids, labels)
inertias.append(inertia)

plt.plot(range(1, max_k + 1), inertias, 'bx-')


plt.xlabel('Number of clusters (K)')
plt.ylabel('Inertia')
plt.title('Elbow Method to Determine Optimal K')
plt.show()

# Main flow
X = generate_data()

# Visualize the dataset


plt.scatter(X[:, 0], X[:, 1], s=50)
plt.title('Generated Dataset')
plt.show()

# K-Means with K=4


kmeans = KMeansScratch(k=4)
centroids, labels = kmeans.fit(X)

# Visualize the clusters and centroids


visualize_clusters(X, centroids, labels)

# Elbow method to find optimal K


elbow_method(X, max_k=10)
K-Means Clustering Theory

Overview: K-Means clustering is an unsupervised machine learning algorithm used to partition


a dataset into K clusters, where each data point belongs to the cluster with the nearest mean. It
is widely used for tasks like data segmentation, pattern recognition, and image compression.

How It Works: The K-Means algorithm aims to minimize the within-cluster sum of squares
(WCSS), also known as inertia, which is the sum of squared distances between data points and
their respective cluster centroids.
LAB ASSIGNMENT-11

Title: Implement Dimensionality reduction using Principle Component Analysis (PCA)


method.

Objective: The objective of this lab assignment is to implement Dimensionality Reduction


using Principal Component Analysis (PCA) and gain hands-on experience in reducing the
dimensionality of a dataset while preserving its essential information.

Dataset: Load a dataset of your choice or generate a synthetic dataset.

Tasks:

1) Implement the PCA algorithm from scratch or using scikit-learn.

2) Perform data standardization (mean-centering and scaling) as a preprocessing step for


PCA.

3) Determine the number of principal components to retain.


4) Apply PCA to the preprocessed dataset and reduce its dimensionality.

5) Visualize the dataset before and after PCA using scatterplots or other appropriate
visualizations.
6) Evaluate the impact of dimensionality reduction on the dataset's performance in a
machine learning task (e.g., classification or regression).

7) Suggest possible use cases where PCA can be beneficial.

PCA (Principal Component Analysis) can be beneficial in various scenarios, particularly where
data dimensionality is high or complex. Here are some common use cases:

1. Data Visualization:
o PCA can reduce high-dimensional data to 2 or 3 principal components, making
it easier to visualize patterns or clusters in the data.
o This is particularly useful for exploring relationships and structures within
complex datasets.
2. Noise Reduction:
o PCA can help filter out noise by retaining only the components that carry
significant information, thus improving the quality of the data.
o This is often used in image processing and signal processing to enhance the
clarity of data.
3. Speeding Up Machine Learning Algorithms:
o By reducing the number of dimensions, PCA can speed up the training process
of machine learning models, especially when working with large datasets.
o This is crucial for algorithms like support vector machines (SVMs) or neural
networks that can become computationally expensive with high-dimensional
data.
4. Feature Selection:
o PCA can be used as a feature extraction method, allowing you to retain the most
informative features from the dataset.
o This helps in reducing overfitting and improving model generalization by
focusing on the most significant variables.
5. Multicollinearity Reduction:
o In datasets with highly correlated features, PCA can transform the features into
uncorrelated principal components.
o This is helpful in regression analysis to avoid issues like multicollinearity,
where predictors are linearly dependent on each other.
6. Anomaly Detection:
o PCA can be used to project data into a lower-dimensional space where
anomalies (outliers) are more distinguishable.
o This technique is often applied in fraud detection, network security, and
industrial quality control.
7. Image Compression:
oPCA is effective for reducing the size of images while preserving their essential
features, making it a common choice for image compression.
o This can help in scenarios where storage space is limited or when transferring
large datasets over the network.
8. Gene Expression Analysis:
o In bioinformatics, PCA is used to analyze high-dimensional gene expression
data, helping researchers identify patterns, group similar genes, and study
genetic variations.
o It simplifies the interpretation of genetic data and aids in understanding complex
biological processes.

These use cases highlight how PCA can be an essential tool for handling high-dimensional data
and improving the performance and interpretability of machine learning models.
LAB ASSIGNMENT-12

Title: Implement Artificial Neural Networks, specially Multilayer Feedforward Neural


Networks

Objective: The objective of this experiment is to demonstrate the ability of a multilayer


feedforward neural network (MLFFNN) in solving linearly inseparable pattern classification
problems.

Link: https://cse22-iiith.vlabs.ac.in/exp/forward-neural-networks/

Tasks:

1) Read Theory
A Multilayer Feedforward Neural Network (MLFFNN) with two or more hidden layers and
nonlinear units can handle complex pattern classification tasks. It adjusts weights to map input
vectors to desired outputs through **backpropagation**, which minimizes the error between
actual and desired outputs using **gradient descent**.

In backpropagation, errors are calculated at the output layer and propagated backward to
update weights

Backpropagation can update weights either after each input (pattern mode) or after all inputs
(batch mode). Its performance depends on factors like initial weights, learning rate, and data
presentation. While convergence is not guaranteed, adjustments and stopping criteria help
guide the process.

2) Follow the Procedure


• This is a 3 layer MLFFNN with one hidden layer, one input layer, and one output
layer.
• Select the problem type and the number of nodes in the hidden layer, and click on
train MLFFNN.
• Now click on test MLFFNN to test the network and to see the results of pattern
classification.

3) Simulation it. Build and apply at least 3 layer neural network.


4) Change hidden layer, number of neurons in one layer and simulate it.
5) Write your observations and effect of change in layers and neurons.

Observations on Changes in Layers and Neurons

1. Impact of Hidden Layers:


o Increasing Layers: Adding more hidden layers can enhance the network's
capacity to learn complex patterns. However, it also increases the risk of
overfitting, especially if the dataset is small.
o Decreasing Layers: Reducing the number of hidden layers can lead to
underfitting, where the model fails to capture the underlying trends in the data.
2. Effect of Neurons in Hidden Layers:
o Increasing Neurons: More neurons per layer can improve the model's ability to
capture nuances in the data, leading to better performance in terms of accuracy.
However, this can also lead to longer training times and greater complexity in
the model.
o Decreasing Neurons: Fewer neurons might result in a simpler model, which can
be beneficial for generalization on smaller datasets. However, this might also
restrict the model's ability to learn sufficiently from complex datasets.
3. Performance Metrics:
o Accuracy: As the number of layers and neurons increases, accuracy tends to
improve up to a certain point. After reaching an optimal configuration, further
increases can lead to diminishing returns or even a decrease in accuracy due to
overfitting.
o Training and Validation Loss: The training loss generally decreases as the
complexity of the model increases. However, the validation loss might start
increasing if the model overfits, indicating that it is not generalizing well to
unseen data.
4. Training Time:
o Models with more layers and neurons require significantly more computational
resources and training time. This trade-off must be considered when designing
neural networks, especially in resource-constrained environments.
5. Activation Functions:
o The choice of activation function also influences how changes in layers and
neurons affect performance. For example, using ReLU (Rectified Linear Unit)
can help alleviate issues like the vanishing gradient problem in deeper networks,
allowing for better training dynamics.
6. Generalization:
o Finding the right balance between model complexity (layers and neurons) is
crucial for achieving good generalization. Techniques such as dropout, early
stopping, and regularization can help mitigate overfitting when increasing
model complexity.

Conclusion

The configuration of layers and neurons in an MLFFNN significantly affects the model's
performance and training dynamics. Striking the right balance is essential for achieving optimal
results, and experimenting with different architectures is vital to understanding how these
changes influence the network's ability to learn from data.

It's important to analyze the performance metrics and visualization plots (like loss curves) to
make informed decisions about model architecture.

You might also like