0% found this document useful (0 votes)
10 views29 pages

ML Lab34

The document describes a series of experiments involving data processing and machine learning techniques applied to housing data and a linear regression model. It includes steps for data imputation, anomaly detection, standardization, normalization, and encoding, followed by the implementation of a gradient descent algorithm for linear regression. The final results include predictions from the model and statistical calculations related to height and weight data.

Uploaded by

Manav Purswani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views29 pages

ML Lab34

The document describes a series of experiments involving data processing and machine learning techniques applied to housing data and a linear regression model. It includes steps for data imputation, anomaly detection, standardization, normalization, and encoding, followed by the implementation of a gradient descent algorithm for linear regression. The final results include predictions from the model and statistical calculations related to height and weight data.

Uploaded by

Manav Purswani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

Experiment 1

import pandas as pd
data = pd.read_csv('large_housing_data_mumbai.csv')
print("Original Data:")
print(data.head())

Original Data:
House_ID Bedrooms Size (sq ft) Price (INR) Location Year_Built
0 1 4.0 855.0 31356226.0 Juhu 2002.0
1 2 5.0 1847.0 27775439.0 Andheri 2004.0
2 3 NaN 2363.0 37325149.0 Bandra 2000.0
3 4 5.0 626.0 6147116.0 South Mumbai 2002.0
4 5 5.0 NaN 49899606.0 Worli NaN

#Imputation
#Handle missing values using median for numerical columns and the most
frequent value for categorical columns.
from sklearn.impute import SimpleImputer
num_features = ['Bedrooms', 'Size (sq ft)', 'Price (INR)', 'Year_Built']
cat_features = ['Location']
num_imputer = SimpleImputer(strategy='median')
data[num_features] = num_imputer.fit_transform(data[num_features])
cat_imputer = SimpleImputer(strategy='most_frequent')
data[cat_features] = cat_imputer.fit_transform(data[cat_features])
print("\nData After Imputation:")
print(data.head())

Data After Imputation:


House_ID Bedrooms Size (sq ft) Price (INR) Location Year_Built
0 1 4.0 855.0 31356226.0 Juhu 2002.0
1 2 5.0 1847.0 27775439.0 Andheri 2004.0
2 3 3.0 2363.0 37325149.0 Bandra 2000.0
3 4 5.0 626.0 6147116.0 South Mumbai 2002.0
4 5 5.0 1702.5 49899606.0 Worli 2012.0

#Anomaly Detection
#Detect anomalies in the dataset. Here, we use Z-scores to identify anomalies
in the Price (INR) column.
from scipy import stats
z_scores = stats.zscore(data[num_features])
data['Anomaly'] = (abs(z_scores) > 3).any(axis=1) # Mark anomalies
print("\nData After Anomaly Detection:")
print(data.head())
#Rule-Based Anomaly Detection
#simple rules where:
#A house with less than 1000 sq ft should have 1 to 2 bedrooms.
#A house with 1000-2000 sq ft should have 2 to 4 bedrooms.
#A house with more than 2000 sq ft should have 3 or more bedrooms.
def is_bedroom_size_reasonable(row):
if row['Size (sq ft)'] < 1000:
return 1 <= row['Bedrooms'] <= 2
elif row['Size (sq ft)'] <= 2000:
return 2 <= row['Bedrooms'] <= 4
else:
return row['Bedrooms'] >= 3
data['Bed_Size_Anomaly'] = ~data.apply(is_bedroom_size_reasonable, axis=1)
print("\nData After Rule-Based Anomaly Detection:")
print(data.head())

Data After Anomaly Detection:


House_ID Bedrooms Size (sq ft) Price (INR) Location Year_Built \
0 1 4.0 855.0 31356226.0 Juhu 2002.0
1 2 5.0 1847.0 27775439.0 Andheri 2004.0
2 3 3.0 2363.0 37325149.0 Bandra 2000.0
3 4 5.0 626.0 6147116.0 South Mumbai 2002.0
4 5 5.0 1702.5 49899606.0 Worli 2012.0

Anomaly
0 False
1 False
2 False
3 False
4 False

Data After Rule-Based Anomaly Detection:


House_ID Bedrooms Size (sq ft) Price (INR) Location Year_Built \
0 1 4.0 855.0 31356226.0 Juhu 2002.0
1 2 5.0 1847.0 27775439.0 Andheri 2004.0
2 3 3.0 2363.0 37325149.0 Bandra 2000.0
3 4 5.0 626.0 6147116.0 South Mumbai 2002.0
4 5 5.0 1702.5 49899606.0 Worli 2012.0

Anomaly Bed_Size_Anomaly
0 False True
1 False True
2 False False
3 False True
4 False True

#Standardization
#Standardize numerical features so they have a mean of 0 and a standard
deviation of 1.
from sklearn.preprocessing import StandardScaler
# Standardize numericals
scaler = StandardScaler()
data[num_features] = scaler.fit_transform(data[num_features])
print("\nData After Standardization:")
print(data.head())

Data After Standardization:


House_ID Bedrooms Size (sq ft) Price (INR) Location Year_Built \
0 1 0.710719 -1.231366 -0.017529 Juhu -1.432248
1 2 1.432261 0.194650 -0.110953 Andheri -1.124866
2 3 -0.010823 0.936408 0.138203 Bandra -1.739631
3 4 1.432261 -1.560557 -0.675243 South Mumbai -1.432248
4 5 1.432261 -0.013071 0.466275 Worli 0.104664

Anomaly Bed_Size_Anomaly
0 False True
1 False True
2 False False
3 False True
4 False True

#Normalization
#Normalize numerical features to fit within the range [0, 1]
from sklearn.preprocessing import MinMaxScaler

normalizer = MinMaxScaler()
data[num_features] = normalizer.fit_transform(data[num_features])
print("\nData After Normalization:")
print(data.head())

Data After Normalization:


House_ID Bedrooms Size (sq ft) Price (INR) Location Year_Built \
0 1 0.75 0.142055 0.058226 Juhu 0.090909
1 2 1.00 0.540128 0.050309 Andheri 0.181818
2 3 0.50 0.747191 0.071422 Bandra 0.000000
3 4 1.00 0.050161 0.002493 South Mumbai 0.090909
4 5 1.00 0.482143 0.099222 Worli 0.545455

Anomaly Bed_Size_Anomaly
0 False True
1 False True
2 False False
3 False True
4 False True

#Encoding
#One-Hot Encode the categorical feature Location.
from sklearn.preprocessing import OneHotEncoder
# One-Hot Encoding for 'Location'
encoder = OneHotEncoder(sparse=False)
encoded_location = encoder.fit_transform(data[['Location']])
encoded_df = pd.DataFrame(encoded_location,
columns=encoder.get_feature_names_out(['Location']))

data_encoded = pd.concat([data, encoded_df], axis=1).drop('Location', axis=1)

print("\nData After Encoding:")


print(data_encoded.head())

Data After Encoding:


House_ID Bedrooms Size (sq ft) Price (INR) Year_Built Anomaly \
0 1 0.75 0.142055 0.058226 0.090909 False
1 2 1.00 0.540128 0.050309 0.181818 False
2 3 0.50 0.747191 0.071422 0.000000 False
3 4 1.00 0.050161 0.002493 0.090909 False
4 5 1.00 0.482143 0.099222 0.545455 False

Bed_Size_Anomaly Location_Andheri Location_Bandra Location_Borivali \


0 True 0.0 0.0 0.0
1 True 1.0 0.0 0.0
2 False 0.0 1.0 0.0
3 True 0.0 0.0 0.0
4 True 0.0 0.0 0.0

Location_Juhu Location_Malad Location_Pali Hill Location_South Mumbai


\
0 1.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 1.0
4 0.0 0.0 0.0 0.0

Location_Worli
0 0.0
1 0.0
2 0.0
3 0.0
4 1.0

/usr/local/lib/python3.10/dist-
packages/sklearn/preprocessing/_encoders.py:975: FutureWarning: `sparse` was
renamed to `sparse_output` in version 1.2 and will be removed in 1.4.
`sparse_output` is ignored unless you leave `sparse` to its default value.
warnings.warn(
Experiment 2
import numpy as np

class GradientDescentMSE:
def __init__(self, lr=0.01, n_iters=1000):
self.lr = lr
self.n_iters = n_iters
self.x1 = None
self.x2 = None

def fit(self, X, y):


# Initialize parameters
self.x1 = np.random.randn() # Initialize x1
self.x2 = np.random.randn() # Initialize x2

for _ in range(self.n_iters):
# Compute predictions
y_pred = self.x1 * X[:, 0] + self.x2 * X[:, 1]

# Compute gradients for MSE


grad_x1 = (2/len(y)) * np.sum((y_pred - y) * X[:, 0])
grad_x2 = (2/len(y)) * np.sum((y_pred - y) * X[:, 1])

# Update parameters
self.x1 = self.x1 - self.lr * grad_x1
self.x2 = self.x2 - self.lr * grad_x2

return self.x1, self.x2

def objective_function(self, X):


return self.x1 * X[:, 0] + self.x2 * X[:, 1]

# Example dataset with two features


X = np.array([[0.5, 1.0], [1.0, 2.0], [1.5, 2.5], [2.0, 3.5]]) # Features
y = np.array([1.5, 2.5, 3.0, 4.0]) # True values

# Initialize and run gradient descent


gd_mse = GradientDescentMSE(lr=0.01, n_iters=1000)
final_x1, final_x2 = gd_mse.fit(X, y)
final_predictions = gd_mse.objective_function(X)

print(f"Final x1: {final_x1}, Final x2: {final_x2}")


print(f"Final Predictions: {final_predictions}")

Final x1: -1.3189740745133114, Final x2: 1.9351908822144432


Final Predictions: [1.27570384 2.55140769 2.85951609 4.13521994]
Experiment 3
import pandas as pd
import numpy as np
file_path = '/ml-linear-reg.csv'
data = pd.read_csv(file_path)
#display data
print(data)

Height Weight
0 151 63
1 174 81
2 138 56
3 186 91
4 128 47
5 136 57
6 179 76
7 163 72
8 152 62
9 131 48

#mean of X (Height) and Y (Weight)


x_mean = np.mean(data['Height'])
y_mean = np.mean(data['Weight'])

#Display
print(f"Mean of Height (x_mean): {x_mean}")
print(f"Mean of Weight (y_mean): {y_mean}")

Mean of Height (x_mean): 153.8


Mean of Weight (y_mean): 65.3

# Calculate xi - x_bar and yi - y_bar


data['xi-xbar'] = data['Height'] - x_mean
data['yi-ybar'] = data['Weight'] - y_mean

#Display xi - x_bar and yi - y_bar


print(data[['Height', 'Weight', 'xi-xbar', 'yi-ybar']])

Height Weight xi-xbar yi-ybar


0 151 63 -2.8 -2.3
1 174 81 20.2 15.7
2 138 56 -15.8 -9.3
3 186 91 32.2 25.7
4 128 47 -25.8 -18.3
5 136 57 -17.8 -8.3
6 179 76 25.2 10.7
7 163 72 9.2 6.7
8 152 62 -1.8 -3.3
9 131 48 -22.8 -17.3
# Calculate product of (xi - x_bar) and (yi - y_bar)
data['(xi-xbar)*(yi-ybar)'] = data['xi-xbar'] * data['yi-ybar']

# Display product of (xi - x_bar) and (yi - y_bar)


print(data[['Height', 'Weight', 'xi-xbar', 'yi-ybar', '(xi-xbar)*(yi-
ybar)']])

Height Weight xi-xbar yi-ybar (xi-xbar)*(yi-ybar)


0 151 63 -2.8 -2.3 6.44
1 174 81 20.2 15.7 317.14
2 138 56 -15.8 -9.3 146.94
3 186 91 32.2 25.7 827.54
4 128 47 -25.8 -18.3 472.14
5 136 57 -17.8 -8.3 147.74
6 179 76 25.2 10.7 269.64
7 163 72 9.2 6.7 61.64
8 152 62 -1.8 -3.3 5.94
9 131 48 -22.8 -17.3 394.44

# Calculate square of (xi - x_bar)


data['sq(xi-xbar)'] = data['xi-xbar'] ** 2

# Display square of (xi - x_bar)


print(data[['Height', 'Weight', 'xi-xbar', 'yi-ybar', '(xi-xbar)*(yi-ybar)',
'sq(xi-xbar)']])

Height Weight xi-xbar yi-ybar (xi-xbar)*(yi-ybar) sq(xi-xbar)


0 151 63 -2.8 -2.3 6.44 7.84
1 174 81 20.2 15.7 317.14 408.04
2 138 56 -15.8 -9.3 146.94 249.64
3 186 91 32.2 25.7 827.54 1036.84
4 128 47 -25.8 -18.3 472.14 665.64
5 136 57 -17.8 -8.3 147.74 316.84
6 179 76 25.2 10.7 269.64 635.04
7 163 72 9.2 6.7 61.64 84.64
8 152 62 -1.8 -3.3 5.94 3.24
9 131 48 -22.8 -17.3 394.44 519.84

# Calculate sum of square of (xi - x_bar)


sum_sq_xi_xbar = np.sum(data['sq(xi-xbar)'])

# Display sum of square of (xi - x_bar)


print(f"Sum of square of (xi - x_bar): {sum_sq_xi_xbar}")

Sum of square of (xi - x_bar): 3927.6000000000004

# Calculate sum of (xi - x_bar) * (yi - y_bar)


sum_xiyi_xbar_ybar = np.sum(data['(xi-xbar)*(yi-ybar)'])

# Display sum of (xi - x_bar) * (yi - y_bar)


print(f"Sum of (xi - x_bar) * (yi - y_bar): {sum_xiyi_xbar_ybar}")
Sum of (xi - x_bar) * (yi - y_bar): 2649.6

# Calculate b1 (slope)
b1 = sum_xiyi_xbar_ybar / sum_sq_xi_xbar

# Display b1 (slope)
print(f"Slope (b1): {b1}")

Slope (b1): 0.6746104491292392

# Calculate b0 (intercept)
b0 = y_mean - b1 * x_mean

# Display b0 (intercept)
print(f"Intercept (b0): {b0}")

Intercept (b0): -38.45508707607699

# Define a function to predict weight from height


def predict(height):
return b1 * height + b0

# Example prediction
height_new = 160
weight_prediction = predict(height_new)
print(f'Predicted weight for height {height_new} cm is
{weight_prediction:.2f} kg')

Predicted weight for height 160 cm is 69.48 kg


Experiment 4
import numpy as np
1
The sigmoid function is defined as 𝜙(𝑧) = 1+𝑧−𝑧

$\therefore \phi (\hat{y}) = \frac{1}{1 + e^(-\hat{y})} $


$\therefore \hat{y} = \frac{1}{1 + e^(-wx+b)} $
def sigmoid(x):
return 1/(1+np.exp(-x))

class LogisticRegression():

def __init__(self, lr=0.01, n_iters=1000):


self.lr = lr
self.n_iters = n_iters
self.weights = None
self.bias = None

def fit(self, X, y):


n_samples, n_features = X.shape
self.weights = np.zeros(n_features)
self.bias = 0

for _ in range(self.n_iters):
linear_pred = np.dot(X, self.weights) + self.bias
y_pred = sigmoid(linear_pred) # Logistic addition. Rest all is
Linear Regression.

dw = (1/n_samples) * np.dot(X.T, (y_pred - y))


db = (1/n_samples) * np.sum(y_pred - y)

self.weights = self.weights - self.lr*dw


self.bias = self.bias - self.lr*db

def predict(self, X):


linear_pred = np.dot(X, self.weights) + self.bias
y_pred = sigmoid(linear_pred)
class_pred = [0 if y<=0.5 else 1 for y in y_pred]
return class_pred

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import datasets
import matplotlib.pyplot as plt

bc = datasets.load_breast_cancer()
X, y = bc.data, bc.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=1234)

model = LogisticRegression(lr=0.01)
model.fit(X_train,y_train)
y_pred = model.predict(X_test)

def accuracy(y_pred, y_test):


return np.sum(y_pred==y_test)/len(y_test)

acc = accuracy(y_pred, y_test)


print(acc)

0.9210526315789473

C:\Users\rohra\AppData\Local\Temp\ipykernel_19392\4033946986.py:2:
RuntimeWarning: overflow encountered in exp
return 1/(1+np.exp(-x))

import pandas as pd

results = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})

print(results)

Actual Predicted
0 1 1
1 1 1
2 1 1
3 1 1
4 1 1
.. ... ...
109 1 1
110 0 0
111 1 0
112 0 0
113 0 0

[114 rows x 2 columns]


Experiment 5
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv("penguins_size.csv")
df.head()

# EDA
# Missing Data
df.info()

df.isna().sum()

# What percentage are we dropping?


100*(10/344)
df = df.dropna()
df.info()

df.head()

df['sex'].unique()

df['island'].unique()

df = df[df['sex']!='.']
# Feature Engineering
pd.get_dummies(df)
pd.get_dummies(df.drop('species',axis=1),drop_first=True)

# Train and Test split


X = pd.get_dummies(df.drop('species',axis=1),drop_first=True)
y = df['species']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Decision Tree Classifier
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier()
model.fit(X_train,y_train)
base_pred = model.predict(X_test)
from sklearn.metrics import confusion_matrix, classification_report, ConfusionMatrixDisplay
import matplotlib.pyplot as plt

# Generate confusion matrix


y_pred = model.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
# Display the confusion matrix
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot()

plt.show()

print(classification_report(y_test,base_pred))

model.feature_importances_
pd.DataFrame(index=X.columns,data=model.feature_importances_,columns=['Feature
Importance'])

# Visualize the tree


from sklearn.tree import plot_tree
plt.figure(figsize=(12,8))
plot_tree(model);

from sklearn.tree import plot_tree


import matplotlib.pyplot as plt
# Convert X.columns to a list
plt.figure(figsize=(12, 8), dpi=150)
plot_tree(model, filled=True, feature_names=X.columns.tolist())

plt.show()

def report_model(model):
model_preds = model.predict(X_test)
print(classification_report(y_test,model_preds))
print('\n')
plt.figure(figsize=(12,8),dpi=150)
plot_tree(model,filled=True,feature_names=X.columns);
pruned_tree = DecisionTreeClassifier(max_depth=2)
pruned_tree.fit(X_train,y_train)
from sklearn.tree import plot_tree
import matplotlib.pyplot as plt

def report_model(model):
# Print classification report if needed (e.g., precision, recall, etc.)
print('\n')

# Convert X.columns to a list before passing to plot_tree


plt.figure(figsize=(12, 8), dpi=150)
plot_tree(model, filled=True, feature_names=X.columns.tolist())
plt.show()

pruned_tree = DecisionTreeClassifier(max_leaf_nodes=3)
pruned_tree.fit(X_train,y_train)
report_model(pruned_tree)

entropy_tree = DecisionTreeClassifier(criterion='entropy')
entropy_tree.fit(X_train,y_train)
report_model(entropy_tree)
Experiment 6
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.preprocessing import LabelEncoder

# Load dataset from CSV


data = pd.read_csv('iris.csv')

print(data)

# Use only the first two features for training and visualization
X = data.iloc[:, :2].values # First two features
y = data.iloc[:, -1].values # Target variable (last column)

# Encode target labels (species) to numeric values


label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)

# Split the data into training and test sets


X_train, X_test, y_train, y_test = train_test_split(X, y_encoded, test_size=0.3,
random_state=42)

# Create and train the SVM model with RBF kernel


svm_rbf = SVC(kernel='rbf', gamma='auto')
svm_rbf.fit(X_train, y_train)

# Make predictions
y_pred = svm_rbf.predict(X_test)

# Accuracy
accuracy = accuracy_score(y_test, y_pred) * 100
print(f"Accuracy: {accuracy:.4f}\n")

# Classification report (formatted as a DataFrame)


report_dict = classification_report(y_test, y_pred,
target_names=label_encoder.classes_, output_dict=True)
report_df = pd.DataFrame(report_dict).transpose()
print("Classification Report:\n", report_df)

# Confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)
print("\nConfusion Matrix:")
print(conf_matrix)
print()

# Visualize the Confusion Matrix as a heatmap


plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues',
xticklabels=label_encoder.classes_, yticklabels=label_encoder.classes_)
plt.ylabel('True Label')
plt.xlabel('Predicted Label')
plt.title('Confusion Matrix Heatmap')
plt.show()

print()

# Visualize the decision boundary (for 2D data)


def plot_decision_boundary(X, y, model):
h = .02 # step size in the mesh
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))

Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

plt.contourf(xx, yy, Z, alpha=0.8, cmap=plt.cm.coolwarm)


plt.scatter(X[:, 0], X[:, 1], c=y, edgecolors='k', marker='o',
cmap=plt.cm.coolwarm)
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('SVM Decision Boundary with RBF Kernel')
plt.show()

# Plot decision boundary using the test set


plot_decision_boundary(X_test, y_test, svm_rbf)
Experiment 7
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Load dataset from CSV


data = pd.read_csv('iris.csv')

# Use only the first two features for training and visualization
X = data.iloc[:, :2].values # First two features
y = data.iloc[:, -1].values # Target variable (last column)

# Encode target labels (species) to numeric values


label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)

# Split the data into training and test sets


X_train, X_test, y_train, y_test = train_test_split(X, y_encoded, test_size=0.3,
random_state=42)

# 1. SVM Model
svm_rbf = SVC(kernel='rbf', gamma='auto', probability=True)
svm_rbf.fit(X_train, y_train)
y_pred_svm = svm_rbf.predict(X_test)

# 2. Random Forest Model (Bagging)


rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)

# 3. AdaBoost Model (Boosting)


ada = AdaBoostClassifier(base_estimator=SVC(kernel='linear', probability=True),
n_estimators=50, random_state=42)
ada.fit(X_train, y_train)
y_pred_ada = ada.predict(X_test)

# Combine predictions using majority voting


y_pred_ensemble = np.array([y_pred_svm, y_pred_rf, y_pred_ada])
y_pred_final = np.array([np.bincount(x).argmax() for x in y_pred_ensemble.T])

# Accuracy for each model


for model_name, y_pred in zip(['SVM', 'Random Forest', 'AdaBoost', 'Ensemble'],
[y_pred_svm, y_pred_rf, y_pred_ada, y_pred_final]):
accuracy = accuracy_score(y_test, y_pred)
print(f"{model_name} Accuracy: {accuracy:.4f}\n")

# Classification report for the ensemble model


report_dict = classification_report(y_test, y_pred_final, target_names=label_encoder.classes_,
output_dict=True)
report_df = pd.DataFrame(report_dict).transpose()
print("Ensemble Classification Report:\n", report_df)

# Confusion matrix for the ensemble model


conf_matrix = confusion_matrix(y_test, y_pred_final)
print("\nEnsemble Confusion Matrix:")
print(conf_matrix)

# Visualize the Confusion Matrix as a heatmap for the ensemble model


plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues',
xticklabels=label_encoder.classes_, yticklabels=label_encoder.classes_)
plt.ylabel('True Label')
plt.xlabel('Predicted Label')
plt.title('Ensemble Confusion Matrix Heatmap')
plt.show()

# Visualize the decision boundary for the ensemble model


def plot_decision_boundary(X, y, model):
h = .02 # step size in the mesh
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))

Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

plt.contourf(xx, yy, Z, alpha=0.8, cmap=plt.cm.coolwarm)


plt.scatter(X[:, 0], X[:, 1], c=y, edgecolors='k', marker='o', cmap=plt.cm.coolwarm)
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Ensemble Decision Boundary with Bagging and Boosting')
plt.show()

# Since we can't train an ensemble model directly, we just plot the decision boundary using the
SVM model
plot_decision_boundary(X_test, y_test, svm_rbf)
Experiment 8
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt

# Step 1: Load the Iris dataset from a CSV file


data = pd.read_csv('iris.csv')

# Assuming the last column is the target label and the rest are features
X = data.iloc[:, :-1].values # Features (all rows, all columns except the last)
y = data.iloc[:, -1].values # Target (all rows, last column)

# Map string labels to integers


label_mapping = {'setosa': 0, 'versicolor': 1, 'virginica': 2}
y_numeric = np.array([label_mapping[label] for label in y])

# Step 2: Standardize the data


scaler = StandardScaler()
X_std = scaler.fit_transform(X)

# Step 3: Calculate the covariance matrix


cov_matrix = np.cov(X_std.T)

# Step 4: Calculate the eigenvalues and eigenvectors


eigenvalues, eigenvectors = np.linalg.eig(cov_matrix)

# Step 5: Sort the eigenvalues and eigenvectors


sorted_indices = np.argsort(eigenvalues)[::-1]
eigenvalues_sorted = eigenvalues[sorted_indices]
eigenvectors_sorted = eigenvectors[:, sorted_indices]

# Step 6: Select the number of principal components


n_components = 2
eigenvectors_subset = eigenvectors_sorted[:, :n_components]

# Step 7: Transform the data


X_pca = X_std.dot(eigenvectors_subset)

# Step 8: Visualize the results


plt.figure(figsize=(8, 6))
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y_numeric, cmap='viridis', edgecolor='k', s=100)
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('PCA of Iris Dataset')
plt.grid()
plt.show()

You might also like