0% found this document useful (0 votes)
15 views11 pages

ML Lab Manual 4-8

The document outlines the implementation of various machine learning algorithms using the sklearn library, including Multiple Linear Regression, Decision Trees, K-Nearest Neighbors (KNN), Logistic Regression, and K-Means Clustering, primarily using the Iris and California Housing datasets. Each section details the objective, theoretical background, dataset description, evaluation metrics, and code snippets for model training and evaluation. The document emphasizes the importance of model evaluation metrics such as accuracy, confusion matrix, and various clustering performance metrics.

Uploaded by

naiduusha227
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views11 pages

ML Lab Manual 4-8

The document outlines the implementation of various machine learning algorithms using the sklearn library, including Multiple Linear Regression, Decision Trees, K-Nearest Neighbors (KNN), Logistic Regression, and K-Means Clustering, primarily using the Iris and California Housing datasets. Each section details the objective, theoretical background, dataset description, evaluation metrics, and code snippets for model training and evaluation. The document emphasizes the importance of model evaluation metrics such as accuracy, confusion matrix, and various clustering performance metrics.

Uploaded by

naiduusha227
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

5.

Multiple Linear Regression for House Price Prediction

Objective
To implement a Multiple Linear Regression model using the sklearn library to predict house prices
based on various features of a given housing dataset.

Theory
Multiple Linear Regression is a supervised learning algorithm that models the relationship between
a dependent variable y and multiple independent variables x1 , x2 , . . . , xn . The model’s goal is to find a
linear relationship in the form:

y = b0 + b1 x1 + b2 x2 + · · · + bn xn

where b0 is the intercept, and b1 , b2 , . . . , bn are the coefficients.


Evaluation Metrics:
ˆ Mean Squared Error (MSE): Measures the average squared difference between actual and
predicted values. Lower values indicate a better fit.
ˆ R-squared Score (R2 ): Represents the proportion of variance explained by the model. An R2
closer to 1 indicates a better fit.

Dataset
The California Housing Dataset contains information about California housing, including median
house prices and features like median income, housing median age, etc. The dataset is accessed using
fetch california housing() from sklearn.datasets.

Program Code
# Required Libraries
import numpy as np
import pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt

# Load Dataset
house_data = fetch_california_housing()
x = pd.DataFrame(house_data.data, columns=house_data.feature_names)
y = pd.Series(house_data.target)

# Display Dataset Description


print(house_data.DESCR)

# Splitting the Dataset


X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

1
# Initialize and Train the Model
model = LinearRegression().fit(X_train, y_train)

# Make Predictions
y_pred = model.predict(X_test)

# Model Evaluation
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

# Output Model Coefficients and Metrics


print("Model Coefficients:", model.coef_)
print("Model Intercept:", model.intercept_)
print("Mean Squared Error:", mse)
print("R-squared:", r2)

# Display Actual vs Predicted Values


comparison = pd.DataFrame({'Actual Price': y_test, 'Predicted Price': y_pred})
print(comparison.head())

# Visualization
plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred, color='blue', edgecolor='k', alpha=0.7)

plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)],


color='red', linestyle='--', linewidth=2)

plt.title('Actual vs Predicted House Prices')


plt.xlabel('Actual Price')
plt.ylabel('Predicted Price')
plt.grid(True)
plt.show()

Conclusion
This program implements muliple linear regression to predict house price from the dataset imported
from sklearn.

2
6. Implementation of Decision Tree using sklearn and Parameter
Tuning
Objective
To implement a Decision Tree classifier on the Iris dataset, understand its structure, and enhance the
model’s performance using parameter tuning techniques with sklearn.

Theory
Decision Tree Classifier: A decision tree is a supervised learning algorithm used for both classification
and regression tasks. It splits the dataset into branches based on feature values, creating a structure
resembling a tree to classify or predict the target variable. For classification tasks, the decision tree uses
different criteria, such as Gini impurity or Entropy, to determine the best split for data at each node.
Decision trees can easily overfit; therefore, parameters like max depth and min samples split are tuned
to prevent excessive branching.
Parameter Tuning: By adjusting parameters in the decision tree model, we can improve its accuracy
and avoid issues such as overfitting (where the model learns noise instead of patterns) or underfitting
(where the model is too simple). For a Decision Tree, the key parameters to tune include:
ˆ criterion: Measures the quality of the split, using either gini or entropy.

ˆ max depth: Limits the tree’s depth to prevent overfitting.

ˆ min samples split: Sets the minimum number of samples required to split an internal node.

Dataset
Iris Dataset: The Iris dataset is a commonly used dataset for classification. It contains 150 instances,
with each instance described by four features (sepal length, sepal width, petal length, petal width) and
classified into one of three species (Setosa, Versicolor, Virginica).

ˆ Features:

– Sepal length (cm)


– Sepal width (cm)
– Petal length (cm)
– Petal width (cm)

ˆ Target: Species (Setosa, Versicolor, Virginica)

Evaluation Parameters
To assess the model’s performance, the following metrics are used:
ˆ Accuracy: Measures the proportion of correct predictions out of total predictions.

ˆ Cross-Validation Score: During parameter tuning, cross-validation helps verify how well the
model generalizes across different data splits.

Code
# Importing necessary libraries
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn import metrics
import matplotlib.pyplot as plt

3
from sklearn import tree

# Load the Iris dataset


iris = load_iris()
X = iris.data # Features
y = iris.target # Target

# Description of Iris Dataset


print(iris.DESCR)

# Splitting the dataset into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Creating the Decision Tree model with initial parameters


model = DecisionTreeClassifier(criterion='entropy', random_state=42)

# Training the model


model.fit(X_train, y_train)

# Predicting the target for the test data


y_pred = model.predict(X_test)

# Evaluating the model's accuracy


accuracy = metrics.accuracy_score(y_test, y_pred)
print("Accuracy before tuning:", accuracy)

# Parameter tuning using GridSearchCV


param_grid = {
'criterion': ['gini', 'entropy'],
'max_depth': [3, 5, 7, None],
'min_samples_split': [2, 5, 10]
}

grid_search = GridSearchCV(DecisionTreeClassifier(random_state=42), param_grid, cv=5)


grid_search.fit(X_train, y_train)

# Best parameters and accuracy after tuning


best_model = grid_search.best_estimator_
print("Best Parameters:", grid_search.best_params_)
print("Best Model Accuracy after tuning:", grid_search.best_score_)

# Final prediction using the best model


y_pred_best = best_model.predict(X_test)
final_accuracy = metrics.accuracy_score(y_test, y_pred_best)
print("Final Accuracy with tuned model:", final_accuracy)

# Visualizing the best Decision Tree


plt.figure(figsize=(12, 8))
tree.plot_tree(best_model, filled=True, feature_names=iris.feature_names, class_names=iris.target_na
plt.title("Tuned Decision Tree")
plt.show()

Analysis

ˆ Effect of Parameter Tuning: Parameter tuning has a significant effect on model performance.
By optimizing max depth and min samples split, we reduce overfitting and make the tree more
generalized.
ˆ Cross-Validation and Best Parameters: The GridSearchCV technique helps identify the best

4
combination of parameters using cross-validation, which verifies the model’s stability across differ-
ent training data subsets.
ˆ Model Accuracy: The accuracy improved after tuning, demonstrating the effectiveness of select-
ing optimal parameters. This parameter tuning process is essential for achieving a balanced model
performance.
Conclusion This Program demonstrates how to implement a Decision Tree classifier using sklearn,
visualize it, and enhance its performance through parameter tuning. Proper tuning can help achieve a
better accuracy score and an optimized decision tree structure, which generalizes well on test data.

5
7. Implementation of K-Nearest Neighbors (KNN) Using sklearn
Objective
To implement the K-Nearest Neighbors (KNN) algorithm using the sklearn library to classify the Iris
dataset, exploring the effect of different values of k (number of neighbors).

Theory
K-Nearest Neighbors (KNN) is a simple, instance-based learning algorithm used for classification
and regression tasks. For a given data point, it finds the k-closest points in the training dataset and
classifies the point based on the majority class of these neighbors.
Distance Metric: The algorithm typically uses Euclidean distance to measure the proximity of data
points. The choice of k (number of neighbors) influences the algorithm’s bias and variance.

ˆ Low k (e.g., 1) can lead to high variance (overfitting).

ˆ High k can increase bias, making the model too simple.

Dataset
Iris Dataset: This dataset contains measurements of sepal length, sepal width, petal length, and petal
width for 150 iris flowers, divided into three classes: Iris Setosa, Iris Versicolour, and Iris Virginica.
Features Used: For simplicity, only the first two features (sepal length and sepal width) are used
in this implementation.

Evaluation Parameters
ˆ Accuracy: Measures the percentage of correctly classified samples in the test set.

ˆ Confusion Matrix: Displays the counts of actual vs. predicted classifications, helping to evaluate
the model’s accuracy per class.

Code

# Required Libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score, confusion_matrix

# Load Iris dataset


iris = load_iris()
X = iris.data[:, :2] # Take only the first two features for simplicity
y = iris.target
print(iris.DESCR) # Display dataset description

# Split the dataset into training and test sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initialize KNN model with k=3


knn = KNeighborsClassifier(n_neighbors=3)

# Train the model


knn.fit(X_train, y_train)

6
# Predict on test set
y_pred = knn.predict(X_test)

# Evaluation Criteria
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
print(f"Accuracy: {accuracy * 100:.2f}%")
print("Confusion Matrix:")
print(conf_matrix)

# Visualization
plt.figure(figsize=(8, 6))

plt.scatter(X_train[:, 0], X_train[:, 1], c=y_train,


cmap='coolwarm', marker='*', s=100, label='Training Data')

plt.scatter(X_test[:, 0], X_test[:, 1], c='green', marker='x', s=200, label='Test Data')

# Draw lines to nearest neighbors for each test point


for test_point in X_test:
distances, indices = knn.kneighbors([test_point])
for index in indices[0]:
neighbor = X_train[index]
plt.plot([test_point[0], neighbor[0]], [test_point[1], neighbor[1]], 'k--', lw=1)

plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.title('K-Nearest Neighbors Visualization (Iris Dataset)')
plt.legend()
plt.show()

Conclusion
The K-Nearest Neighbors algorithm is sensitive to the choice of k, which impacts classification accuracy.
This program implements KNN algorithm for IRIS dataset and uses proximity to classify data and
outputs the accuracy metrics.

7
8. Implementation of Logistic Regression Using sklearn
Objective
To implement a Logistic Regression model using the sklearn library for binary classification using the
Iris dataset and evaluate its performance based on accuracy, confusion matrix, and classification report.

Theory
Logistic Regression is a supervised learning algorithm used for binary classification problems. It
models the probability of a binary outcome based on one or more predictor variables.
ˆ The model uses the sigmoid function to convert linear predictions into probabilities, where:
1
σ(z) =
1 + e−z

and z = wT x + b.
ˆ Binary Classification: For binary outcomes, logistic regression predicts either 0 or 1 based on a
threshold, typically 0.5.
ˆ Evaluation Metrics: Common metrics include accuracy, precision, recall, F1-score, and the
confusion matrix.

Dataset
ˆ Iris Dataset: This dataset contains measurements of various features for three types of Iris flowers.
For simplicity, only two of the three classes are used in this implementation (binary classification).
ˆ Features Used: Only the first two features (sepal length and sepal width) are used to simplify
visualization.

Evaluation Parameters
ˆ Accuracy: The proportion of correctly classified samples out of the total samples.

ˆ Confusion Matrix: A table layout that allows visualization of the performance of an algorithm,
showing actual vs. predicted classes.
ˆ Classification Report: Provides precision, recall, F1-score, and support for each class, which
helps understand model performance in a more detailed manner.

Code
# Required Libraries
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Loading the dataset (using Iris dataset as an example)


data = load_iris()
X = data.data
y = data.target

# Selecting only two classes for binary classification


X = X[y != 2, :2] # Using only the first two features for easy visualization
y = y[y != 2]

# Splitting the dataset into training and testing sets

8
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Creating the Logistic Regression model


model = LogisticRegression()

# Training the model


model.fit(X_train, y_train)

# Making predictions
y_pred = model.predict(X_test)

# Evaluating the model


accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)

# Printing evaluation metrics


print("Accuracy:", accuracy)
print("Confusion Matrix:\n", conf_matrix)
print("Classification Report:\n", class_report)

Conclusion
This Program demonstrates the implementation of Logistic Regression for binary classification on the
Iris dataset:
ˆ Accuracy: The model’s accuracy shows the proportion of correctly classified instances in the test
data.
ˆ Confusion Matrix: The confusion matrix allows analysis of misclassifications and correctly clas-
sified instances for each class.
ˆ Classification Report: Shows detailed metrics (precision, recall, F1-score) for each class, helping
in evaluating the model’s effectiveness.

9
9.Implementation of K-Means Clustering Using sklearn
Objective
To implement the K-Means clustering algorithm using the sklearn library for unsupervised clustering
on the Iris dataset and to evaluate clustering performance using metrics such as inertia, adjusted Rand
index, and silhouette score.

Theory
K-Means Clustering is an unsupervised learning algorithm that partitions a dataset into K distinct,
non-overlapping clusters. Each cluster is defined by its centroid, which is the mean of the points in that
cluster.
ˆ Steps of K-Means:

1. Randomly initialize K centroids.


2. Assign each data point to the closest centroid, forming clusters.
3. Recalculate the centroid of each cluster based on the points assigned to it.
4. Repeat steps 2 and 3 until centroids stabilize.

ˆ Distance Metric: The clusters are assigned using Euclidean distance, which measures the
straight-line distance between data points and centroids.
ˆ Hyperparameter K: The number of clusters K must be set manually.

Dataset
ˆ Iris Dataset: Contains measurements of features for three Iris flower species. Only the first two
features are used for 2D visualization.
ˆ Features Used: Sepal length and sepal width.

Evaluation Parameters
ˆ Cluster Centroids: Mean points of each cluster.

ˆ Inertia: Sum of squared distances between each point and its cluster centroid.

ˆ Adjusted Rand Index (ARI): Measures similarity between true labels and clustering labels.
ARI is 1 for perfect clustering and 0 for random.
ˆ Silhouette Score: Measures how similar each point is to its own cluster compared to other
clusters. Scores range from -1 (poor) to +1 (ideal clustering).

Code
# Required Libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import load_iris
from sklearn.metrics import adjusted_rand_score, silhouette_score

# Load Iris Dataset


iris = load_iris()
X = iris.data[:, :2] # Use only the first two features for 2D visualization
y = iris.target # True labels for comparison (not used in clustering)

# Plotting the Original Data


plt.figure(figsize=(12, 5))

10
# Original Data
plt.subplot(1, 2, 1)
plt.scatter(X[:, 0], X[:, 1], c=y, cmap='viridis', marker='o', edgecolor='k', s=100)
plt.title('Original Iris Dataset')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.grid()

# K-Means Clustering
kmeans = KMeans(n_clusters=3, random_state=42)
kmeans.fit(X)
predictions = kmeans.predict(X)

# Evaluation Metrics
inertia = kmeans.inertia_
ari = adjusted_rand_score(y, predictions)
silhouette_avg = silhouette_score(X, predictions)

print(f"Inertia: {inertia:.2f}")
print(f"Adjusted Rand Index (ARI): {ari:.2f}")
print(f"Silhouette Score: {silhouette_avg:.2f}")

# Plotting the K-Means Result


plt.subplot(1, 2, 2)
plt.scatter(X[:, 0], X[:, 1], c=predictions, cmap='viridis', marker='o', edgecolor='k', s=100)

plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1],


c='red', marker='X', s=200, label='Centroids')

plt.title('K-Means Clustering Result')


plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend()
plt.grid()

plt.tight_layout()
plt.show()

Conclusion
This experiment demonstrates the implementation of K-Means clustering and includes evaluation metrics
for a deeper understanding:
ˆ Centroids: The centroids serve as the centers of each cluster.

ˆ Euclidean Distance for Assignment: Data points are assigned to clusters based on the mini-
mum Euclidean distance to the cluster centroids.
ˆ Inertia: Measures compactness within clusters.

ˆ Adjusted Rand Index (ARI): Shows similarity between true labels and predicted clusters.

ˆ Silhouette Score: Measures the quality of clustering; higher values indicate well-defined clusters.

This experiment provides insights into unsupervised clustering and evaluates clustering quality through
multiple metrics.

11

You might also like