0% found this document useful (0 votes)
27 views25 pages

MLA Manual

The document is a lab manual for the Machine Learning Algorithms course at G H Raisoni University, detailing course outcomes, practical exercises, and evaluation metrics. It includes practical tasks such as evaluating regression models, implementing classification models, and using various machine learning algorithms like KNN, Linear Regression, and Decision Trees. Each practical section outlines the aim, required software, theoretical background, and sample programs with expected outputs.

Uploaded by

harshnebhnani02
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views25 pages

MLA Manual

The document is a lab manual for the Machine Learning Algorithms course at G H Raisoni University, detailing course outcomes, practical exercises, and evaluation metrics. It includes practical tasks such as evaluating regression models, implementing classification models, and using various machine learning algorithms like KNN, Linear Regression, and Decision Trees. Each practical section outlines the aim, required software, theoretical background, and sample programs with expected outputs.

Uploaded by

harshnebhnani02
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

G H RAISONI UNIVERSITY, AMRAVATI

SCHOOL OF ENGINEERING & TECHNOLOGY

Department of Computer Science & Engineering

Lab Manual

Subject: Machine Learning Algorithms


(UAIPR206)

Semester / Branch:
SEM-VI / BTECH CSE

Department of Computer Science & Engineering


G H RAISONI UNIVERSITY ,AMRAVATI
G H RAISONI UNIVERSITY, AMRAVATI

SCHOOL OF ENGINEERING & TECHNOLOGY

Department of Computer Science & Engineering

SCHOOL OF ENGINEERING AND TECHNOLOGY

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

NAME OF PROGRAM: B. TECH CSE

Name of Course: Machine Learning Algorithms Course Code: UAIPR206


Year/Semester: III / VI Session: 2024-25

Course Outcome: After completion of the course, students will be able to


CO1 : Understand modern notions in machine learning and computing
CO2 : Understand a wide variety of learning algorithms
CO3 : Be capable of confidently applying common Machine Learning algorithms in practice and implementing
their own
CO4 : Evaluate Machine Learning Models generated from data
CO5 : Apply the algorithms to a real problem, optimize the models learned and report on the expected accuracy
that can be achieved by applying the models

PRACTICAL LIST

Sr. No. List of Practical


1 Evaluate a regression model using the following metrics: 1. Mean Absolute Error
2. Mean Squared Error 3. R squared Error
2 Implement a classification model on given dataset.
3 Implement a Linear Regression Model on given dataset.
4 Implement a Decision tree on given dataset.
5 Identify Over fitting and Underfitting on given dataset.
6 Implement Logistic Regression on given dataset.
7 Implement Gaussian Naïve Bayes learning on given dataset.
8 Implement PCA Algorithm
9 Implement K-Means Clustering
10 Implement Gaussian Mixture Models

Practical Teacher HOD


Dr.Mahip Bartere Dr. Amit Gaikwad
Dr. Ajay Kumar
Prof. Sneha Bohra
G H RAISONI UNIVERSITY, AMRAVATI
SCHOOL OF ENGINEERING & TECHNOLOGY

Department of Computer Science & Engineering

Practical No 1
Aim: Write a program to evaluate a regression model using the following metrics:
1. Mean Absolute Error
2. Mean Squared Error
3. R squared Error
Software Required: Python ,VS Code or Google Colab
Theory: Regression algorithms are algorithms used to expect continuous numerical values
primarily based on entering features.
Types of Regression Metrics
Some common regression metrics are:
1. Mean Absolute Error (MAE)
2. Mean Squared Error (MSE)
3. R-squared (R²) Score

Mean Absolute Error (MAE): In the fields of statistics and machine learning, the Mean
Absolute Error (MAE) is a frequently employed metric. It's a measurement of the typical absolute
discrepancies between a dataset's actual values and projected values.
Mathematical Formula
The formula to calculate MAE for a data with "n" data points is:
𝑛
1
𝑀𝐴𝐸 = ∑ |𝑥𝑖 − 𝑦𝑖 |
𝑛
𝑖=1
Where:
 xi represents the actual or observed values for the i-th data point.
 yi represents the predicted value for the i-th data point.

Mean Squared Error (MSE): A popular metric in statistics and machine learning is the Mean
Squared Error (MSE). It measures the square root of the average discrepancies between a
dataset's actual values and projected values. MSE is frequently utilized in regression issues and
is used to assess how well predictive models work.
Mathematical Formula
For a dataset containing 'n' data points, the MSE calculation formula is:
𝑛
1
𝑀𝑆𝐸 = ∑(𝑥𝑖 − 𝑦𝑖 )2
𝑛
𝑖=1
where:
 xi represents the actual or observed value for the i-th data point.
 yi represents the predicted value for the i-th data point.

R-squared (R²) Score: A statistical metric frequently used to assess the goodness of fit of a
regression model is the R-squared (R2) score, also referred to as the coefficient of
determination. It quantifies the percentage of the dependent variable's variation that the model's
independent variables contribute to. R2 is a useful statistic for evaluating the overall
effectiveness and explanatory power of a regression model.
Mathematical Formula
The formula to calculate the R-squared score is as follows:
G H RAISONI UNIVERSITY, AMRAVATI
SCHOOL OF ENGINEERING & TECHNOLOGY
𝑆𝑆𝑅
𝑅2 = 1 −
𝑆𝑆𝑇
Where:
 R2 is the R-Squared.
 SSR represents the sum of squared residuals between the predicted values and actual values.
 SST represents the total sum of squares, which measures the total variance in the dependent
variable.

Program:
from sklearn.metrics import mean_absolute_error,mean_squared_error,r2_score

true_values = [2.5, 3.7, 1.8, 4.0, 5.2]


predicted_values = [2.1, 3.9, 1.7, 3.8, 5.0]

mae = mean_absolute_error(true_values, predicted_values)


print("Mean Absolute Error:", mae)
mse=mean_squared_error(true_values,predicted_values)
print("Mean Squared Error:",mse)
r2_score=r2_score(true_values,predicted_values)
print("R2-Square:",r2_score)

Output:
Mean Absolute Error: 0.22000000000000003
Mean Squared Error: 0.057999999999999996
R2-Square: 0.9588769143505389

Conclusion: Thus, we have successfully evaluated the regression model using various metrics.
G H RAISONI UNIVERSITY, AMRAVATI
SCHOOL OF ENGINEERING & TECHNOLOGY
Practical No 2
Aim: Evaluate a classification model
Problem Statement: Use Iris Dataset to implement KNN Classification.
Software Required: Python ,VS Code or Google Colab
Theory: Classification is a type of supervised learning in machine learning, where the goal is to
predict the categorical label (or class) of a given input based on labeled training data. In other
words, the model is trained on a dataset where the input data is associated with a predefined class,
and it learns to map the input features to these classes. After training, the model can then classify
new, unseen data into one of the predefined classes.
Types of Classification
1. Binary Classification:
o A classification problem where the task is to classify the input data into one of two
classes.
o Example: Predicting whether an email is "spam" or "not spam."
2. Multiclass Classification:
o A classification problem where the task is to classify the input data into more than
two classes.
o Example: Predicting the type of flower (Iris dataset) into categories like "setosa,"
"versicolor," or "virginica."
3. Multilabel Classification:
o A problem where each data point can be assigned multiple labels simultaneously.
o Example: Predicting which genres a movie belongs to (action, comedy, drama, etc.).
A single movie can belong to multiple genres.

Common Classification Algorithms:


1. Logistic Regression
2. K-Nearest Neighbors (KNN)
3. Support Vector Machines (SVM)
4. Decision Trees
5. Random Forest
6. Naive Bayes
7. Neural Networks

Evaluation Metrics for Classification: To assess the performance of classification models, various
metrics are used, depending on the nature of the task and the class distribution:
1. Accuracy
2. Precision
3. Recall (Sensitivity)
4. F1 Score
5. Confusion Matrix
6. Area Under the ROC Curve (AUC-ROC)

Program:
# Import necessary libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
G H RAISONI UNIVERSITY, AMRAVATI
SCHOOL OF ENGINEERING & TECHNOLOGY

from sklearn.metrics import accuracy_score, classification_report,


confusion_matrix

# Load the Iris dataset


iris = load_iris()
X = iris.data # Features
y = iris.target # Target labels

# Convert the features into a DataFrame to display the features with column
names
iris_df = pd.DataFrame(X, columns=iris.feature_names)

# Add the target variable 'species' to the DataFrame


iris_df['species'] = iris.target_names[y] # Map target values to species
names

# Display the first few rows of the dataset (features)


print("Iris Dataset Features:")
print(iris_df.head()) # Displaying the first 5 rows

# Plot Pairwise Scatter Plots (Using pairplot from seaborn)


sns.pairplot(iris_df, hue="species", palette="Set2",
plot_kws={'alpha':0.5}) # Now 'species' column is available
plt.suptitle('Pairwise Scatter Plots of Iris Dataset', y=1.02)
plt.show()

# Plot Correlation Heatmap


corr_matrix = iris_df.drop(columns=['species']).corr() # Calculate
correlation for numerical features only
plt.figure(figsize=(6, 6))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt='.2f',
linewidths=0.5)
plt.title('Correlation Heatmap of Iris Dataset')
plt.show()

# Split the dataset into training and testing sets (80% training, 20%
testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)

# Feature scaling (Standardize the data)


scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Initialize the KNN classifier with k=3


knn = KNeighborsClassifier(n_neighbors=3)

# Train the model


knn.fit(X_train, y_train)
G H RAISONI UNIVERSITY, AMRAVATI
SCHOOL OF ENGINEERING & TECHNOLOGY

# Make predictions on the test set


y_pred = knn.predict(X_test)

# Evaluate the model


accuracy = accuracy_score(y_test, y_pred)
print(f"\nAccuracy: {accuracy:.2f}")

# Print the classification report


print("\nClassification Report:")
print(classification_report(y_test, y_pred))

# Print the confusion matrix


print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))

Output:
Iris Dataset Features:
sepal length (cm) sepal width (cm) petal length (cm) petal width
(cm) species
0 5.1 3.5 1.4 0.2
setosa
1 4.9 3.0 1.4 0.2
setosa
2 4.7 3.2 1.3 0.2
setosa
3 4.6 3.1 1.5 0.2
setosa
4 5.0 3.6 1.4 0.2
setosa

Accuracy: 1.00

Classification Report:
precision recall f1-score support

0 1.00 1.00 1.00 10


1 1.00 1.00 1.00 9
2 1.00 1.00 1.00 11

accuracy 1.00 30
macro avg 1.00 1.00 1.00 30
weighted avg 1.00 1.00 1.00 30

Confusion Matrix:
[[10 0 0]
[ 0 9 0]
[ 0 0 11]]

Conclusion: Thus, we have successfully implemented Classification Model.


G H RAISONI UNIVERSITY, AMRAVATI
SCHOOL OF ENGINEERING & TECHNOLOGY
Practical No 3
Aim: Implement a Linear Regression Model on given dataset.
Software Required: Python ,VS Code or Google Colab
Program:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Step 1: Load the California Housing dataset


data = fetch_california_housing()
X = data.data[:, :1] # Let's use just the first feature for
simplicity in visualization
y = data.target

# Step 2: Split the data into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.3, random_state=42)

# Step 3: Initialize the Linear Regression model


model = LinearRegression()

# Step 4: Train the model using the training data


model.fit(X_train, y_train)

# Step 5: Make predictions using the trained model


y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)

# Step 6: Evaluate the model performance using Mean Squared


Error (MSE) and R-squared (R²)
train_mse = mean_squared_error(y_train, y_train_pred)
test_mse = mean_squared_error(y_test, y_test_pred)
train_r2 = r2_score(y_train, y_train_pred)
test_r2 = r2_score(y_test, y_test_pred)

# Print the results


print(f"Training Mean Squared Error: {train_mse:.4f}")
print(f"Test Mean Squared Error: {test_mse:.4f}")
print(f"Training R²: {train_r2:.4f}")
print(f"Test R²: {test_r2:.4f}")

# Step 7: Visualize the results


plt.figure(figsize=(8, 6))

# Plot the training data and the model's predictions


plt.scatter(X_train, y_train, color='blue', label='Training
Data', alpha=0.5)
plt.plot(X_train, y_train_pred, color='red', label='Linear
Regression Line (Training)', linewidth=2)

# Plot the testing data and the model's predictions


G H RAISONI UNIVERSITY, AMRAVATI
SCHOOL OF ENGINEERING & TECHNOLOGY
plt.scatter(X_test, y_test, color='green', label='Testing
Data', alpha=0.5)
plt.plot(X_test, y_test_pred, color='orange', label='Linear
Regression Line (Testing)', linestyle='--', linewidth=2)

plt.title('Linear Regression on California Housing Dataset')


plt.xlabel('Feature 1: Median Income')
plt.ylabel('Median House Value')
plt.legend()
plt.show()

Output:
Training Mean Squared Error: 0.7051
Test Mean Squared Error: 0.6918
Training R²: 0.4737
Test R²: 0.4729

Conclusion: Thus, we have successfully implemented Linear Regression.


G H RAISONI UNIVERSITY, AMRAVATI
SCHOOL OF ENGINEERING & TECHNOLOGY
Practical No 4
Aim: Implement a Decision tree on given dataset.
Problem Statement: Use Iris Dataset to implement Decision Tree.
Software Required: Python ,VS Code or Google Colab
Theory: A Decision Tree is a popular and powerful algorithm used in machine learning for both
classification and regression tasks. It is a supervised learning method that recursively splits the
dataset into subsets based on the features, ultimately creating a tree-like model of decisions. The
goal is to make predictions by following paths from the root of the tree to a leaf node based on the
input features.
In a classification task, a decision tree predicts a class label for a given input, while in a regression
task, it predicts a continuous value.
Key Concepts in Decision Trees
1. Root Node: The root node represents the entire dataset, which is then recursively split into
subgroups based on the feature that results in the most effective division. The root node contains
the whole dataset before any decisions are made.
2.Internal Nodes: Each internal node represents a decision or a test on one of the features. Based
on the outcome of the test, the dataset is split into two or more branches.
3. Branches: Branches represent the outcome of a test (decision) at each internal node. Each
branch connects a node to its child node.
4. Leaf Nodes (Terminal Nodes): Leaf nodes represent the final decision or outcome. In
classification tasks, they contain the predicted class label. In regression tasks, they contain the
predicted continuous value.
5.Splitting:The process of dividing the dataset into subsets based on a specific feature and its
corresponding threshold or category. The goal is to create pure nodes, where the data points in
each node are as similar as possible with respect to the target variable.
6.Pruning: Pruning is the process of removing branches from the decision tree after it has been
grown to its full size. This helps to prevent overfitting and ensures that the model generalizes well
to new data.

Program:
# Importing the required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import metrics
import seaborn as sns
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn import tree

# Loading the dataset


iris = load_iris()

#converting the data to a pandas dataframe


data = pd.DataFrame(data = iris.data, columns = iris.feature_names)

#creating a separate column for the target variable of iris dataset


data['Species'] = iris.target

#replacing the categories of target variable with the actual names of the
species
target = np.unique(iris.target)
G H RAISONI UNIVERSITY, AMRAVATI
SCHOOL OF ENGINEERING & TECHNOLOGY
target_n = np.unique(iris.target_names)
target_dict = dict(zip(target, target_n))
data['Species'] = data['Species'].replace(target_dict)

# Separating the independent dependent variables of the dataset


x = data.drop(columns = "Species")
y = data["Species"]
names_features = x.columns
target_labels = y.unique()

# Splitting the dataset into training and testing datasets


x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3,
random_state = 93)

# Importing the Decision Tree classifier class from sklearn


from sklearn.tree import DecisionTreeClassifier

# Creating an instance of the classifier class


dtc = DecisionTreeClassifier(max_depth = 3, random_state = 93)

# Fitting the training dataset to the model


dtc.fit(x_train, y_train)

# Plotting the Decision Tree


plt.figure(figsize = (30, 10), facecolor = 'b')
Tree = tree.plot_tree(dtc, feature_names = names_features, class_names =
target_labels, rounded = True, filled = True, fontsize = 14)
plt.show()
y_pred = dtc.predict(x_test)

# Finding the confusion matrix


confusion_matrix = metrics.confusion_matrix(y_test, y_pred)
matrix = pd.DataFrame(confusion_matrix)
axis = plt.axes()
sns.set(font_scale = 1.3)
plt.figure(figsize = (10,7))

# Plotting heatmap
sns.heatmap(matrix, annot = True, fmt = "g", ax = axis, cmap = "magma")
axis.set_title('Confusion Matrix')
axis.set_xlabel("Predicted Values", fontsize = 10)
axis.set_xticklabels([''] + target_labels)
axis.set_ylabel( "True Labels", fontsize = 10)
axis.set_yticklabels(list(target_labels), rotation = 0)
plt.show()
G H RAISONI UNIVERSITY, AMRAVATI
SCHOOL OF ENGINEERING & TECHNOLOGY
Output:

Conclusion: Thus, we have successfully implemented Decision Tree.


G H RAISONI UNIVERSITY, AMRAVATI
SCHOOL OF ENGINEERING & TECHNOLOGY
Practical No 5
Aim: Identify Over fitting and Underfitting on given dataset.
Problem Statement: Use Iris Dataset to implement KNN Classification.
Software Required: Python ,VS Code or Google Colab
Theory: In machine learning, the terms overfitting and underfitting refer to the model's ability
to generalize well to unseen data. These are key concepts when evaluating how well a model
fits a dataset, and they are associated with bias-variance trade-off.
Overfitting: Overfitting occurs when a machine learning model learns not only the underlying
patterns in the training data but also the noise and random fluctuations that do not generalize
well to new, unseen data. Essentially, the model becomes too complex and too tailored to the
training data.
Characteristics of Overfitting:
1. High variance, low bias
2. Model Complexity
3. Low training error, high test error
Why Does Overfitting Happen?
1. Too many features
2. Excessive model complexity
3. Insufficient data
Signs of Overfitting:
1. Good performance on training data, poor performance on test data.
2. The model is very sensitive to small changes in the input data.
3. A very complex model (e.g., deep decision trees or a high-degree polynomial in
regression) that closely tracks the training data.
How to Avoid Overfitting?
1. Simplify the model
2. Regularization
3. Cross-validation
4. Increase the dataset size
5. Early stopping

Underfitting: Underfitting occurs when a model is too simple to capture the underlying patterns
of the data. It is characterized by both high bias and low variance. The model is not complex
enough to learn the relationships in the data, resulting in poor performance on both the training
and test datasets.

Characteristics of Underfitting:
1. High bias, low variance
2. Model Complexity
3. High training error, high test error
Why Does Underfitting Happen?
1. Too few features
2. Model too simple
3. Overly strong assumptions

Signs of Underfitting:
1. Poor performance on both training and test data.
2. The model is too simple and doesn't have the capacity to learn the data patterns.
3. The model may consistently make the same error (e.g., using a linear regression on non-
linear data).
How to Avoid Underfitting?
1. Increase model complexity
2. Use more relevant features
G H RAISONI UNIVERSITY, AMRAVATI
SCHOOL OF ENGINEERING & TECHNOLOGY
3. Reduce regularization:
4. Tune hyperparameters

Program:
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import mean_squared_error

# Step 1: Load the California housing dataset


from sklearn.datasets import fetch_california_housing

# Load dataset
data = fetch_california_housing()
X = data.data[:, :1] # Using just one feature for simplicity in
visualization
y = data.target

# Step 2: Split the data into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=42)

# Function to train and evaluate model with different polynomial degrees


def plot_overfitting_underfitting(degree):
# Step 3: Create polynomial features
poly = PolynomialFeatures(degree)
X_poly_train = poly.fit_transform(X_train)
X_poly_test = poly.transform(X_test)

# Step 4: Train the Linear Regression model


model = LinearRegression()
model.fit(X_poly_train, y_train)

# Step 5: Make predictions


y_train_pred = model.predict(X_poly_train)
y_test_pred = model.predict(X_poly_test)

# Step 6: Calculate Mean Squared Error


train_mse = mean_squared_error(y_train, y_train_pred)
test_mse = mean_squared_error(y_test, y_test_pred)

print(f"Degree {degree} -> Train MSE: {train_mse:.4f}, Test MSE:


{test_mse:.4f}")

# Plot the results (showing the fit over the data)


plt.figure(figsize=(8, 6))
X_range = np.linspace(X.min(), X.max(), 1000).reshape(-1, 1)
X_poly_range = poly.transform(X_range)
y_range_pred = model.predict(X_poly_range)

plt.scatter(X, y, color='gray', label='Data points')


plt.plot(X_range, y_range_pred, label=f'Polynomial Degree
G H RAISONI UNIVERSITY, AMRAVATI
SCHOOL OF ENGINEERING & TECHNOLOGY
{degree}', color='red')
plt.title(f'Overfitting and Underfitting (Degree {degree})')
plt.xlabel('Feature 1 (Simplified for Visualization)')
plt.ylabel('Target (Median House Value)')
plt.legend()
plt.show()

# Step 7: Test different polynomial degrees


for degree in [1, 3]:
plot_overfitting_underfitting(degree)

Output:

Conclusion: Thus, we have successfully identified over fitting and under fitting.
G H RAISONI UNIVERSITY, AMRAVATI
SCHOOL OF ENGINEERING & TECHNOLOGY
Practical No 6
Aim: Implement a Logistic Regression Model on given datasets
Problem Statement: Use Diabetes dataset to implement and evaluate Logistic Regression.
Software Required: Python ,VS Code or Google Colab
Theory: A statistical model for binary classification is called logistic regression. Using the sigmoid
function, it forecasts the likelihood that an instance will belong to a particular class, guaranteeing
results between 0 and 1. To minimize the log loss, the model computes a linear combination of
input characteristics, transforms it using the sigmoid, and then optimizes its coefficients using
methods like gradient descent. These coefficients establish the decision boundary that divides the
classes. Because of its ease of use, interpretability, and versatility across multiple domains,
Logistic Regression is widely used in machine learning for problems that involve binary outcomes.
Over fitting can be avoided by implementing regularization. Logistic Regression models the
likelihood that an instance will belong to a particular class. It uses a linear equation to combine the
input information and the sigmoid function to restrict predictions between 0 and 1. Gradient
descent and other techniques are used to optimize the model’s coefficients to minimize the log
loss. These coefficients produce the resulting decision boundary, which divides instances into two
classes. When it comes to binary classification, logistic regression is the best choice because it is
easy to understand, straightforward, and useful in a variety of settings. Generalization can be
improved by using regularization.
Program:
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report,
confusion_matrix, roc_curve, auc
# Load the diabetes dataset
diabetes = load_diabetes()
X, y = diabetes.data, diabetes.target

# Convert the target variable to binary (1 for diabetes, 0 for no


diabetes)
y_binary = (y > np.median(y)).astype(int)
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y_binary,
test_size=0.2, random_state=42)
# Standardize features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
# Train the Logistic Regression model
model = LogisticRegression()
model.fit(X_train, y_train)
# Evaluate the model
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy: {:.2f}%".format(accuracy * 100))
# evaluate the model
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
G H RAISONI UNIVERSITY, AMRAVATI
SCHOOL OF ENGINEERING & TECHNOLOGY
print("\nClassification Report:\n", classification_report(y_test,
y_pred))

Output:
Accuracy: 73.03%
Confusion Matrix:
[[36 13]
[11 29]]

Classification Report:
precision recall f1-score support

0 0.77 0.73 0.75 49


1 0.69 0.72 0.71 40

accuracy 0.73 89
macro avg 0.73 0.73 0.73 89
weighted avg 0.73 0.73 0.73 89

Conclusion: Thus, we have successfully implemented Logistic Regression Model.


G H RAISONI UNIVERSITY, AMRAVATI
SCHOOL OF ENGINEERING & TECHNOLOGY

Practical No 7
Aim: Implement Gaussian Naïve Bayes learning on given dataset.
Problem Statement: Use Wine Dataset to implement Gaussian Naïve Bayes.
Software Required: Python ,VS Code or Google Colab
Program:
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, confusion_matrix
import seaborn as sns

# Step 1: Load the Wine dataset


wine = datasets.load_wine()
X = wine.data # Features (13 chemical properties of wines)
y = wine.target # Target labels (3 classes of wines)

# Step 2: Split the data into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.3, random_state=42)

# Step 3: Initialize and train the Gaussian Naive Bayes model


gnb = GaussianNB()
gnb.fit(X_train, y_train)

# Step 4: Make predictions on the test set


y_pred = gnb.predict(X_test)

# Step 5: Evaluate the model


accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)

# Print the accuracy and confusion matrix


print(f"Accuracy: {accuracy * 100:.2f}%")
print("Confusion Matrix:")
print(conf_matrix)

# Step 6: Visualize the confusion matrix using a heatmap


sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues',
xticklabels=wine.target_names, yticklabels=wine.target_names)
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix for Gaussian Naive Bayes (Wine
Dataset)')
plt.show()
G H RAISONI UNIVERSITY, AMRAVATI
SCHOOL OF ENGINEERING & TECHNOLOGY
Output:
Accuracy: 100.00%
Confusion Matrix:
[[19 0 0]
[ 0 21 0]
[ 0 0 14]]

Conclusion: Thus, we have successfully implemented Gaussian Naïve Bayes.


G H RAISONI UNIVERSITY, AMRAVATI
SCHOOL OF ENGINEERING & TECHNOLOGY

Practical No 8
Aim: Implement PCA Algorithm
Problem Statement: Use Breast Cancer Dataset to implement PCA Algorithm.
Software Required: Python ,VS Code or Google Colab
Theory: Principal Component Analysis (PCA) is a dimensionality reduction technique widely
used in data analysis, machine learning, and statistics. PCA transforms the data into a new
coordinate system, where the axes are the principal components that capture the maximum
variance in the data. This technique is essential for simplifying data, improving computational
efficiency, and enhancing visualization, especially in high-dimensional spaces.
Objective of PCA: The primary goal of PCA is to reduce the dimensionality of a dataset while
retaining as much variance (information) as possible. It achieves this by projecting the original
data onto a new set of orthogonal axes called principal components. These components are
ordered in such a way that the first principal component captures the maximum variance in the
data, followed by the second, and so on.
Why Use PCA?
 Dimensionality Reduction: PCA helps in reducing the number of features (dimensions) in a
dataset while preserving the most important patterns in the data. This is especially useful for
visualizing high-dimensional data (e.g., reducing from 30 features to 2 or 3 for visualization).
 Noise Reduction: By discarding components with low variance, PCA can help in eliminating
noisy features that might not contribute to the underlying structure of the data.
 Improving Model Performance: Reducing the number of features can lead to better
performance for machine learning algorithms, particularly for models that struggle with high-
dimensional data.

Program:
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt

# Step 1: Load the Breast Cancer dataset


cancer = load_breast_cancer()
X = cancer.data # Features
y = cancer.target # Target labels

# Step 2: Standardize the data


scaler = StandardScaler()
X_standardized = scaler.fit_transform(X)

# Step 3: Compute the covariance matrix


cov_matrix = np.cov(X_standardized.T)

# Step 4: Calculate the eigenvalues and eigenvectors


eigenvalues, eigenvectors = np.linalg.eig(cov_matrix)

# Step 5: Sort the eigenvalues and eigenvectors


eigenvalue_index = np.argsort(eigenvalues)[::-1]
sorted_eigenvectors = eigenvectors[:, eigenvalue_index]
sorted_eigenvalues = eigenvalues[eigenvalue_index]

# Step 6: Select the top k eigenvectors (for 2D visualization,


k=2)
G H RAISONI UNIVERSITY, AMRAVATI
SCHOOL OF ENGINEERING & TECHNOLOGY
k = 2
eigenvector_subset = sorted_eigenvectors[:, :k]

# Step 7: Project the data onto the new basis (principal


components)
X_pca = X_standardized.dot(eigenvector_subset)

# Step 8: Visualize the projected data (2D plot)


plt.figure(figsize=(8, 6))
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='coolwarm',
edgecolor='k', s=100)
plt.title('PCA - Breast Cancer Dataset')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.colorbar(label='Malignant (1) or Benign (0)')
plt.show()

# Step 9: Print the explained variance ratio (for


understanding how much variance is captured)
explained_variance = sorted_eigenvalues /
np.sum(sorted_eigenvalues)
print("Explained Variance Ratio:", explained_variance[:k])

Output:
Explained Variance Ratio: [0.44272026 0.18971182]

Conclusion: Thus, we have successfully implemented PCA.


G H RAISONI UNIVERSITY, AMRAVATI
SCHOOL OF ENGINEERING & TECHNOLOGY
Practical No 9
Aim: Implement K-Means Clustering
Problem Statement: Use Wine Dataset to implement K-Means Clustering.
Software Required: Python ,VS Code or Google Colab
Theory: K-Means Clustering is one of the most widely used unsupervised learning algorithms
for clustering data. It is a partition-based clustering algorithm that divides a dataset into K
clusters based on feature similarities. The main idea behind K-Means is to assign each data
point to one of the K clusters and update the cluster centers (centroids) iteratively until
convergence.
Objective of K-Means Clustering: The objective of the K-Means algorithm is to partition a
given set of data points into K clusters such that
1. Points within each cluster are more similar to each other than to those in other clusters.
2. Each cluster is represented by its centroid, which is the mean of all points within that
cluster.
Program:
import numpy as np
import pandas as pd
from sklearn.datasets import load_wine
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

# Step 1: Load the Wine dataset


wine = load_wine()
X = wine.data # Features
y = wine.target # Target labels (not used in clustering)

# Step 2: Standardize the data


scaler = StandardScaler()
X_standardized = scaler.fit_transform(X)

# Step 3: Apply K-Means clustering


kmeans = KMeans(n_clusters=3, random_state=42) # Set number
of clusters to 3 (since we know there are 3 wine types)
kmeans.fit(X_standardized)

# Step 4: Get the cluster labels and centroids


labels = kmeans.labels_ # Cluster labels for each data point
centroids = kmeans.cluster_centers_ # Centroids of each
cluster

# Step 5: Visualize the clusters (using the first two features


for 2D visualization)
plt.figure(figsize=(8, 6))
plt.scatter(X_standardized[:, 0], X_standardized[:, 1],
c=labels, cmap='viridis', edgecolor='k', s=100)
plt.scatter(centroids[:, 0], centroids[:, 1], c='red',
marker='X', s=200, label='Centroids')
plt.title('K-Means Clustering on Wine Dataset')
plt.xlabel('Standardized Alcohol Content')
plt.ylabel('Standardized Malic Acid')
plt.legend()
plt.show()
G H RAISONI UNIVERSITY, AMRAVATI
SCHOOL OF ENGINEERING & TECHNOLOGY
# Step 6: Print the cluster centroids
print("Cluster Centroids (after K-Means clustering):")
print(centroids)

# Step 7: Optional: Evaluate the clustering (e.g., using


Inertia or Silhouette Score)
inertia = kmeans.inertia_
print("Inertia (Sum of squared distances to centroids):",
inertia)

Output:
Cluster Centroids (after K-Means clustering):
[[-0.92607185 -0.39404154 -0.49451676 0.17060184 -0.49171185
-0.07598265 0.02081257 -0.03353357 0.0582655 -0.90191402
0.46180361 0.27076419 -0.75384618]
[ 0.16490746 0.87154706 0.18689833 0.52436746 -0.07547277
-0.97933029 -1.21524764 0.72606354 -0.77970639 0.94153874
-1.16478865 -1.29241163 -0.40708796]
[ 0.83523208 -0.30380968 0.36470604 -0.61019129 0.5775868
0.88523736 0.97781956 -0.56208965 0.58028658 0.17106348
0.47398365 0.77924711 1.12518529]]
Inertia (Sum of squared distances to centroids):
1277.928488844642

Conclusion: Thus, we have successfully implemented K-Means Clustering.


G H RAISONI UNIVERSITY, AMRAVATI
SCHOOL OF ENGINEERING & TECHNOLOGY

Practical No 10
Aim: Implement Gaussian Mixture Models
Problem Statement: Use Digit Dataset to implement Gaussian Mixture Models.
Software Required: Python ,VS Code or Google Colab

Program:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits
from sklearn.preprocessing import StandardScaler
from sklearn.mixture import GaussianMixture
from sklearn.decomposition import PCA

# Step 1: Load the Digits dataset


digits = load_digits()
X = digits.data # Features (pixel values of the images)
y = digits.target # Labels (the actual digits, which we won't
use for clustering)

# Step 2: Standardize the data


scaler = StandardScaler()
X_standardized = scaler.fit_transform(X)

# Step 3: Fit the Gaussian Mixture Model (GMM)


# Let's assume there are 10 clusters because there are 10
digits (0-9)
gmm = GaussianMixture(n_components=10, random_state=42)
gmm.fit(X_standardized)

# Step 4: Predict the cluster labels


labels = gmm.predict(X_standardized)

# Step 5: Visualize the GMM clustering result (use PCA to


reduce dimensionality to 2D)
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_standardized)

plt.figure(figsize=(8, 6))
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=labels,
cmap='viridis', edgecolor='k', s=100)
plt.title('Gaussian Mixture Model Clustering (Digits
Dataset)')
plt.xlabel('PCA Component 1')
plt.ylabel('PCA Component 2')
plt.colorbar(label='Cluster Label')
plt.show()

# Step 6: Print the GMM means (centroids) and covariances


(cluster shape)
print("Cluster Means (Centroids):")
print(gmm.means_)

print("\nCovariances (Cluster Shapes):")


G H RAISONI UNIVERSITY, AMRAVATI
SCHOOL OF ENGINEERING & TECHNOLOGY
print(gmm.covariances_)

# Step 7: Log-Likelihood (for model evaluation)


log_likelihood = gmm.score(X_standardized)
print("\nLog-Likelihood of the model:", log_likelihood)

Output:

Conclusion: Thus, we have successfully implemented Gaussian Mixture Models.

You might also like