ML Lab Manual 4-8
ML Lab Manual 4-8
Objective
To implement a Multiple Linear Regression model using the sklearn library to predict house prices
based on various features of a given housing dataset.
Theory
Multiple Linear Regression is a supervised learning algorithm that models the relationship between
a dependent variable y and multiple independent variables x1 , x2 , . . . , xn . The model’s goal is to find a
linear relationship in the form:
y = b0 + b1 x1 + b2 x2 + · · · + bn xn
Dataset
The California Housing Dataset contains information about California housing, including median
house prices and features like median income, housing median age, etc. The dataset is accessed using
fetch california housing() from sklearn.datasets.
Program Code
# Required Libraries
import numpy as np
import pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt
# Load Dataset
house_data = fetch_california_housing()
x = pd.DataFrame(house_data.data, columns=house_data.feature_names)
y = pd.Series(house_data.target)
1
# Initialize and Train the Model
model = LinearRegression().fit(X_train, y_train)
# Make Predictions
y_pred = model.predict(X_test)
# Model Evaluation
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
# Visualization
plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred, color='blue', edgecolor='k', alpha=0.7)
Conclusion
This program implements muliple linear regression to predict house price from the dataset imported
from sklearn.
2
6. Implementation of Decision Tree using sklearn and Parameter
Tuning
Objective
To implement a Decision Tree classifier on the Iris dataset, understand its structure, and enhance the
model’s performance using parameter tuning techniques with sklearn.
Theory
Decision Tree Classifier: A decision tree is a supervised learning algorithm used for both classification
and regression tasks. It splits the dataset into branches based on feature values, creating a structure
resembling a tree to classify or predict the target variable. For classification tasks, the decision tree uses
different criteria, such as Gini impurity or Entropy, to determine the best split for data at each node.
Decision trees can easily overfit; therefore, parameters like max depth and min samples split are tuned
to prevent excessive branching.
Parameter Tuning: By adjusting parameters in the decision tree model, we can improve its accuracy
and avoid issues such as overfitting (where the model learns noise instead of patterns) or underfitting
(where the model is too simple). For a Decision Tree, the key parameters to tune include:
criterion: Measures the quality of the split, using either gini or entropy.
min samples split: Sets the minimum number of samples required to split an internal node.
Dataset
Iris Dataset: The Iris dataset is a commonly used dataset for classification. It contains 150 instances,
with each instance described by four features (sepal length, sepal width, petal length, petal width) and
classified into one of three species (Setosa, Versicolor, Virginica).
Features:
Evaluation Parameters
To assess the model’s performance, the following metrics are used:
Accuracy: Measures the proportion of correct predictions out of total predictions.
Cross-Validation Score: During parameter tuning, cross-validation helps verify how well the
model generalizes across different data splits.
Code
# Importing necessary libraries
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn import metrics
import matplotlib.pyplot as plt
3
from sklearn import tree
# Splitting the dataset into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Analysis
Effect of Parameter Tuning: Parameter tuning has a significant effect on model performance.
By optimizing max depth and min samples split, we reduce overfitting and make the tree more
generalized.
Cross-Validation and Best Parameters: The GridSearchCV technique helps identify the best
4
combination of parameters using cross-validation, which verifies the model’s stability across differ-
ent training data subsets.
Model Accuracy: The accuracy improved after tuning, demonstrating the effectiveness of select-
ing optimal parameters. This parameter tuning process is essential for achieving a balanced model
performance.
Conclusion This Program demonstrates how to implement a Decision Tree classifier using sklearn,
visualize it, and enhance its performance through parameter tuning. Proper tuning can help achieve a
better accuracy score and an optimized decision tree structure, which generalizes well on test data.
5
7. Implementation of K-Nearest Neighbors (KNN) Using sklearn
Objective
To implement the K-Nearest Neighbors (KNN) algorithm using the sklearn library to classify the Iris
dataset, exploring the effect of different values of k (number of neighbors).
Theory
K-Nearest Neighbors (KNN) is a simple, instance-based learning algorithm used for classification
and regression tasks. For a given data point, it finds the k-closest points in the training dataset and
classifies the point based on the majority class of these neighbors.
Distance Metric: The algorithm typically uses Euclidean distance to measure the proximity of data
points. The choice of k (number of neighbors) influences the algorithm’s bias and variance.
Dataset
Iris Dataset: This dataset contains measurements of sepal length, sepal width, petal length, and petal
width for 150 iris flowers, divided into three classes: Iris Setosa, Iris Versicolour, and Iris Virginica.
Features Used: For simplicity, only the first two features (sepal length and sepal width) are used
in this implementation.
Evaluation Parameters
Accuracy: Measures the percentage of correctly classified samples in the test set.
Confusion Matrix: Displays the counts of actual vs. predicted classifications, helping to evaluate
the model’s accuracy per class.
Code
# Required Libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score, confusion_matrix
6
# Predict on test set
y_pred = knn.predict(X_test)
# Evaluation Criteria
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
print(f"Accuracy: {accuracy * 100:.2f}%")
print("Confusion Matrix:")
print(conf_matrix)
# Visualization
plt.figure(figsize=(8, 6))
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.title('K-Nearest Neighbors Visualization (Iris Dataset)')
plt.legend()
plt.show()
Conclusion
The K-Nearest Neighbors algorithm is sensitive to the choice of k, which impacts classification accuracy.
This program implements KNN algorithm for IRIS dataset and uses proximity to classify data and
outputs the accuracy metrics.
7
8. Implementation of Logistic Regression Using sklearn
Objective
To implement a Logistic Regression model using the sklearn library for binary classification using the
Iris dataset and evaluate its performance based on accuracy, confusion matrix, and classification report.
Theory
Logistic Regression is a supervised learning algorithm used for binary classification problems. It
models the probability of a binary outcome based on one or more predictor variables.
The model uses the sigmoid function to convert linear predictions into probabilities, where:
1
σ(z) =
1 + e−z
and z = wT x + b.
Binary Classification: For binary outcomes, logistic regression predicts either 0 or 1 based on a
threshold, typically 0.5.
Evaluation Metrics: Common metrics include accuracy, precision, recall, F1-score, and the
confusion matrix.
Dataset
Iris Dataset: This dataset contains measurements of various features for three types of Iris flowers.
For simplicity, only two of the three classes are used in this implementation (binary classification).
Features Used: Only the first two features (sepal length and sepal width) are used to simplify
visualization.
Evaluation Parameters
Accuracy: The proportion of correctly classified samples out of the total samples.
Confusion Matrix: A table layout that allows visualization of the performance of an algorithm,
showing actual vs. predicted classes.
Classification Report: Provides precision, recall, F1-score, and support for each class, which
helps understand model performance in a more detailed manner.
Code
# Required Libraries
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
8
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Making predictions
y_pred = model.predict(X_test)
Conclusion
This Program demonstrates the implementation of Logistic Regression for binary classification on the
Iris dataset:
Accuracy: The model’s accuracy shows the proportion of correctly classified instances in the test
data.
Confusion Matrix: The confusion matrix allows analysis of misclassifications and correctly clas-
sified instances for each class.
Classification Report: Shows detailed metrics (precision, recall, F1-score) for each class, helping
in evaluating the model’s effectiveness.
9
9.Implementation of K-Means Clustering Using sklearn
Objective
To implement the K-Means clustering algorithm using the sklearn library for unsupervised clustering
on the Iris dataset and to evaluate clustering performance using metrics such as inertia, adjusted Rand
index, and silhouette score.
Theory
K-Means Clustering is an unsupervised learning algorithm that partitions a dataset into K distinct,
non-overlapping clusters. Each cluster is defined by its centroid, which is the mean of the points in that
cluster.
Steps of K-Means:
Distance Metric: The clusters are assigned using Euclidean distance, which measures the
straight-line distance between data points and centroids.
Hyperparameter K: The number of clusters K must be set manually.
Dataset
Iris Dataset: Contains measurements of features for three Iris flower species. Only the first two
features are used for 2D visualization.
Features Used: Sepal length and sepal width.
Evaluation Parameters
Cluster Centroids: Mean points of each cluster.
Inertia: Sum of squared distances between each point and its cluster centroid.
Adjusted Rand Index (ARI): Measures similarity between true labels and clustering labels.
ARI is 1 for perfect clustering and 0 for random.
Silhouette Score: Measures how similar each point is to its own cluster compared to other
clusters. Scores range from -1 (poor) to +1 (ideal clustering).
Code
# Required Libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import load_iris
from sklearn.metrics import adjusted_rand_score, silhouette_score
10
# Original Data
plt.subplot(1, 2, 1)
plt.scatter(X[:, 0], X[:, 1], c=y, cmap='viridis', marker='o', edgecolor='k', s=100)
plt.title('Original Iris Dataset')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.grid()
# K-Means Clustering
kmeans = KMeans(n_clusters=3, random_state=42)
kmeans.fit(X)
predictions = kmeans.predict(X)
# Evaluation Metrics
inertia = kmeans.inertia_
ari = adjusted_rand_score(y, predictions)
silhouette_avg = silhouette_score(X, predictions)
print(f"Inertia: {inertia:.2f}")
print(f"Adjusted Rand Index (ARI): {ari:.2f}")
print(f"Silhouette Score: {silhouette_avg:.2f}")
plt.tight_layout()
plt.show()
Conclusion
This experiment demonstrates the implementation of K-Means clustering and includes evaluation metrics
for a deeper understanding:
Centroids: The centroids serve as the centers of each cluster.
Euclidean Distance for Assignment: Data points are assigned to clusters based on the mini-
mum Euclidean distance to the cluster centroids.
Inertia: Measures compactness within clusters.
Adjusted Rand Index (ARI): Shows similarity between true labels and predicted clusters.
Silhouette Score: Measures the quality of clustering; higher values indicate well-defined clusters.
This experiment provides insights into unsupervised clustering and evaluates clustering quality through
multiple metrics.
11