MLA Manual
MLA Manual
Lab Manual
Semester / Branch:
SEM-VI / BTECH CSE
PRACTICAL LIST
Practical No 1
Aim: Write a program to evaluate a regression model using the following metrics:
1. Mean Absolute Error
2. Mean Squared Error
3. R squared Error
Software Required: Python ,VS Code or Google Colab
Theory: Regression algorithms are algorithms used to expect continuous numerical values
primarily based on entering features.
Types of Regression Metrics
Some common regression metrics are:
1. Mean Absolute Error (MAE)
2. Mean Squared Error (MSE)
3. R-squared (R²) Score
Mean Absolute Error (MAE): In the fields of statistics and machine learning, the Mean
Absolute Error (MAE) is a frequently employed metric. It's a measurement of the typical absolute
discrepancies between a dataset's actual values and projected values.
Mathematical Formula
The formula to calculate MAE for a data with "n" data points is:
𝑛
1
𝑀𝐴𝐸 = ∑ |𝑥𝑖 − 𝑦𝑖 |
𝑛
𝑖=1
Where:
xi represents the actual or observed values for the i-th data point.
yi represents the predicted value for the i-th data point.
Mean Squared Error (MSE): A popular metric in statistics and machine learning is the Mean
Squared Error (MSE). It measures the square root of the average discrepancies between a
dataset's actual values and projected values. MSE is frequently utilized in regression issues and
is used to assess how well predictive models work.
Mathematical Formula
For a dataset containing 'n' data points, the MSE calculation formula is:
𝑛
1
𝑀𝑆𝐸 = ∑(𝑥𝑖 − 𝑦𝑖 )2
𝑛
𝑖=1
where:
xi represents the actual or observed value for the i-th data point.
yi represents the predicted value for the i-th data point.
R-squared (R²) Score: A statistical metric frequently used to assess the goodness of fit of a
regression model is the R-squared (R2) score, also referred to as the coefficient of
determination. It quantifies the percentage of the dependent variable's variation that the model's
independent variables contribute to. R2 is a useful statistic for evaluating the overall
effectiveness and explanatory power of a regression model.
Mathematical Formula
The formula to calculate the R-squared score is as follows:
G H RAISONI UNIVERSITY, AMRAVATI
SCHOOL OF ENGINEERING & TECHNOLOGY
𝑆𝑆𝑅
𝑅2 = 1 −
𝑆𝑆𝑇
Where:
R2 is the R-Squared.
SSR represents the sum of squared residuals between the predicted values and actual values.
SST represents the total sum of squares, which measures the total variance in the dependent
variable.
Program:
from sklearn.metrics import mean_absolute_error,mean_squared_error,r2_score
Output:
Mean Absolute Error: 0.22000000000000003
Mean Squared Error: 0.057999999999999996
R2-Square: 0.9588769143505389
Conclusion: Thus, we have successfully evaluated the regression model using various metrics.
G H RAISONI UNIVERSITY, AMRAVATI
SCHOOL OF ENGINEERING & TECHNOLOGY
Practical No 2
Aim: Evaluate a classification model
Problem Statement: Use Iris Dataset to implement KNN Classification.
Software Required: Python ,VS Code or Google Colab
Theory: Classification is a type of supervised learning in machine learning, where the goal is to
predict the categorical label (or class) of a given input based on labeled training data. In other
words, the model is trained on a dataset where the input data is associated with a predefined class,
and it learns to map the input features to these classes. After training, the model can then classify
new, unseen data into one of the predefined classes.
Types of Classification
1. Binary Classification:
o A classification problem where the task is to classify the input data into one of two
classes.
o Example: Predicting whether an email is "spam" or "not spam."
2. Multiclass Classification:
o A classification problem where the task is to classify the input data into more than
two classes.
o Example: Predicting the type of flower (Iris dataset) into categories like "setosa,"
"versicolor," or "virginica."
3. Multilabel Classification:
o A problem where each data point can be assigned multiple labels simultaneously.
o Example: Predicting which genres a movie belongs to (action, comedy, drama, etc.).
A single movie can belong to multiple genres.
Evaluation Metrics for Classification: To assess the performance of classification models, various
metrics are used, depending on the nature of the task and the class distribution:
1. Accuracy
2. Precision
3. Recall (Sensitivity)
4. F1 Score
5. Confusion Matrix
6. Area Under the ROC Curve (AUC-ROC)
Program:
# Import necessary libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
G H RAISONI UNIVERSITY, AMRAVATI
SCHOOL OF ENGINEERING & TECHNOLOGY
# Convert the features into a DataFrame to display the features with column
names
iris_df = pd.DataFrame(X, columns=iris.feature_names)
# Split the dataset into training and testing sets (80% training, 20%
testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
Output:
Iris Dataset Features:
sepal length (cm) sepal width (cm) petal length (cm) petal width
(cm) species
0 5.1 3.5 1.4 0.2
setosa
1 4.9 3.0 1.4 0.2
setosa
2 4.7 3.2 1.3 0.2
setosa
3 4.6 3.1 1.5 0.2
setosa
4 5.0 3.6 1.4 0.2
setosa
Accuracy: 1.00
Classification Report:
precision recall f1-score support
accuracy 1.00 30
macro avg 1.00 1.00 1.00 30
weighted avg 1.00 1.00 1.00 30
Confusion Matrix:
[[10 0 0]
[ 0 9 0]
[ 0 0 11]]
Output:
Training Mean Squared Error: 0.7051
Test Mean Squared Error: 0.6918
Training R²: 0.4737
Test R²: 0.4729
Program:
# Importing the required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import metrics
import seaborn as sns
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn import tree
#replacing the categories of target variable with the actual names of the
species
target = np.unique(iris.target)
G H RAISONI UNIVERSITY, AMRAVATI
SCHOOL OF ENGINEERING & TECHNOLOGY
target_n = np.unique(iris.target_names)
target_dict = dict(zip(target, target_n))
data['Species'] = data['Species'].replace(target_dict)
# Plotting heatmap
sns.heatmap(matrix, annot = True, fmt = "g", ax = axis, cmap = "magma")
axis.set_title('Confusion Matrix')
axis.set_xlabel("Predicted Values", fontsize = 10)
axis.set_xticklabels([''] + target_labels)
axis.set_ylabel( "True Labels", fontsize = 10)
axis.set_yticklabels(list(target_labels), rotation = 0)
plt.show()
G H RAISONI UNIVERSITY, AMRAVATI
SCHOOL OF ENGINEERING & TECHNOLOGY
Output:
Underfitting: Underfitting occurs when a model is too simple to capture the underlying patterns
of the data. It is characterized by both high bias and low variance. The model is not complex
enough to learn the relationships in the data, resulting in poor performance on both the training
and test datasets.
Characteristics of Underfitting:
1. High bias, low variance
2. Model Complexity
3. High training error, high test error
Why Does Underfitting Happen?
1. Too few features
2. Model too simple
3. Overly strong assumptions
Signs of Underfitting:
1. Poor performance on both training and test data.
2. The model is too simple and doesn't have the capacity to learn the data patterns.
3. The model may consistently make the same error (e.g., using a linear regression on non-
linear data).
How to Avoid Underfitting?
1. Increase model complexity
2. Use more relevant features
G H RAISONI UNIVERSITY, AMRAVATI
SCHOOL OF ENGINEERING & TECHNOLOGY
3. Reduce regularization:
4. Tune hyperparameters
Program:
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import mean_squared_error
# Load dataset
data = fetch_california_housing()
X = data.data[:, :1] # Using just one feature for simplicity in
visualization
y = data.target
Output:
Conclusion: Thus, we have successfully identified over fitting and under fitting.
G H RAISONI UNIVERSITY, AMRAVATI
SCHOOL OF ENGINEERING & TECHNOLOGY
Practical No 6
Aim: Implement a Logistic Regression Model on given datasets
Problem Statement: Use Diabetes dataset to implement and evaluate Logistic Regression.
Software Required: Python ,VS Code or Google Colab
Theory: A statistical model for binary classification is called logistic regression. Using the sigmoid
function, it forecasts the likelihood that an instance will belong to a particular class, guaranteeing
results between 0 and 1. To minimize the log loss, the model computes a linear combination of
input characteristics, transforms it using the sigmoid, and then optimizes its coefficients using
methods like gradient descent. These coefficients establish the decision boundary that divides the
classes. Because of its ease of use, interpretability, and versatility across multiple domains,
Logistic Regression is widely used in machine learning for problems that involve binary outcomes.
Over fitting can be avoided by implementing regularization. Logistic Regression models the
likelihood that an instance will belong to a particular class. It uses a linear equation to combine the
input information and the sigmoid function to restrict predictions between 0 and 1. Gradient
descent and other techniques are used to optimize the model’s coefficients to minimize the log
loss. These coefficients produce the resulting decision boundary, which divides instances into two
classes. When it comes to binary classification, logistic regression is the best choice because it is
easy to understand, straightforward, and useful in a variety of settings. Generalization can be
improved by using regularization.
Program:
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report,
confusion_matrix, roc_curve, auc
# Load the diabetes dataset
diabetes = load_diabetes()
X, y = diabetes.data, diabetes.target
Output:
Accuracy: 73.03%
Confusion Matrix:
[[36 13]
[11 29]]
Classification Report:
precision recall f1-score support
accuracy 0.73 89
macro avg 0.73 0.73 0.73 89
weighted avg 0.73 0.73 0.73 89
Practical No 7
Aim: Implement Gaussian Naïve Bayes learning on given dataset.
Problem Statement: Use Wine Dataset to implement Gaussian Naïve Bayes.
Software Required: Python ,VS Code or Google Colab
Program:
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, confusion_matrix
import seaborn as sns
Practical No 8
Aim: Implement PCA Algorithm
Problem Statement: Use Breast Cancer Dataset to implement PCA Algorithm.
Software Required: Python ,VS Code or Google Colab
Theory: Principal Component Analysis (PCA) is a dimensionality reduction technique widely
used in data analysis, machine learning, and statistics. PCA transforms the data into a new
coordinate system, where the axes are the principal components that capture the maximum
variance in the data. This technique is essential for simplifying data, improving computational
efficiency, and enhancing visualization, especially in high-dimensional spaces.
Objective of PCA: The primary goal of PCA is to reduce the dimensionality of a dataset while
retaining as much variance (information) as possible. It achieves this by projecting the original
data onto a new set of orthogonal axes called principal components. These components are
ordered in such a way that the first principal component captures the maximum variance in the
data, followed by the second, and so on.
Why Use PCA?
Dimensionality Reduction: PCA helps in reducing the number of features (dimensions) in a
dataset while preserving the most important patterns in the data. This is especially useful for
visualizing high-dimensional data (e.g., reducing from 30 features to 2 or 3 for visualization).
Noise Reduction: By discarding components with low variance, PCA can help in eliminating
noisy features that might not contribute to the underlying structure of the data.
Improving Model Performance: Reducing the number of features can lead to better
performance for machine learning algorithms, particularly for models that struggle with high-
dimensional data.
Program:
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
Output:
Explained Variance Ratio: [0.44272026 0.18971182]
Output:
Cluster Centroids (after K-Means clustering):
[[-0.92607185 -0.39404154 -0.49451676 0.17060184 -0.49171185
-0.07598265 0.02081257 -0.03353357 0.0582655 -0.90191402
0.46180361 0.27076419 -0.75384618]
[ 0.16490746 0.87154706 0.18689833 0.52436746 -0.07547277
-0.97933029 -1.21524764 0.72606354 -0.77970639 0.94153874
-1.16478865 -1.29241163 -0.40708796]
[ 0.83523208 -0.30380968 0.36470604 -0.61019129 0.5775868
0.88523736 0.97781956 -0.56208965 0.58028658 0.17106348
0.47398365 0.77924711 1.12518529]]
Inertia (Sum of squared distances to centroids):
1277.928488844642
Practical No 10
Aim: Implement Gaussian Mixture Models
Problem Statement: Use Digit Dataset to implement Gaussian Mixture Models.
Software Required: Python ,VS Code or Google Colab
Program:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits
from sklearn.preprocessing import StandardScaler
from sklearn.mixture import GaussianMixture
from sklearn.decomposition import PCA
plt.figure(figsize=(8, 6))
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=labels,
cmap='viridis', edgecolor='k', s=100)
plt.title('Gaussian Mixture Model Clustering (Digits
Dataset)')
plt.xlabel('PCA Component 1')
plt.ylabel('PCA Component 2')
plt.colorbar(label='Cluster Label')
plt.show()
Output: