Machine Learning (BCSL606) Lab Manual
Machine Learning (BCSL606) Lab Manual
Develop a program to create histograms for all numerical features and analyze the
distribution of each feature. Generate box plots for all numerical features and identify
any outliers. Use California Housing dataset.
Code:
import pandas as pd
import numpy as np
def load_and_prepare_data():
housing = fetch_california_housing()
df = pd.DataFrame(housing.data, columns=housing.feature_names)
df['PRICE'] = housing.target
return df
numerical_features = df.columns
n_features = len(numerical_features)
# Create histograms
plt.figure(figsize=(15, 5*n_rows))
plt.subplot(n_rows, 2, idx)
plt.title(f'Distribution of {feature}')
plt.xlabel(feature)
plt.ylabel('Count')
plt.tight_layout()
if save_plots:
plt.savefig('histograms.png')
plt.show()
plt.figure(figsize=(15, 5*n_rows))
plt.subplot(n_rows, 2, idx)
sns.boxplot(data=df[feature])
plt.tight_layout()
if save_plots:
plt.savefig('boxplots.png')
plt.show()
def analyze_distributions(df):
stats_summary = df.describe()
outlier_summary = {}
Q1 = df[column].quantile(0.25)
Q3 = df[column].quantile(0.75)
IQR = Q3 - Q1
outlier_summary[column] =
{ 'number_of_outliers':
len(outliers),
def main():
df = load_and_prepare_data()
create_distribution_plots(df)
print(stats_summary)
nOutlier Analysis:")
print(f"\n{feature}:")
main()
Output
Introduction
The code performs an exploratory data analysis (EDA) on California housing data. EDA is a
crucial first step in understanding your dataset before performing any advanced analysis or
modeling. This analysis focuses on understanding the distribution of housing features and
prices across California.
The California Housing dataset is a standard dataset in scikit-learn containing housing prices
and related features. The data preparation step converts this into a panda DataFrame, which is
a table-like structure where:
Each column represents a different feature (like house price, income, population)
Distribution Analysis
1. Visual Analysis The distribution plots help understand how values are spread across
each feature:
o Box plots reveal the median, quartiles, and potential outliers in the data
o Outlier detection uses the 1.5 × IQR rule: any point beyond 1.5 times the IQR
from the quartiles is considered an outlier
Visualization System
1. Descriptive Statistics
1. Distribution Plots
2. Statistical Summary
3. Outlier Analysis
Expected Insights
Code:
import pandas as pd
import numpy as np
def load_and_prepare_data():
housing = fetch_california_housing()
df = pd.DataFrame(housing.data, columns=housing.feature_names)
df['PRICE'] = housing.target
return df
correlation_matrix = df.corr()
return correlation_matrix
def plot_correlation_heatmap(correlation_matrix):
plt.figure(figsize=(12, 10))
sns.heatmap(correlation_matrix,
plt.tight_layout()
def create_pair_plot(df):
plt.tight_layout()
plt.show()
def analyze_correlations(correlation_matrix):
upper_tri =
correlation_matrix.where(np.triu(np.ones(correlation_matrix.shape),
k=1).astype(bool))
strong_correlations = []
'correlation': value
})
return strong_correlations
def main():
df = load_and_prepare_data()
correlation_matrix = compute_correlation_matrix(df)
plot_correlation_heatmap(correlation_matrix)
create_pair_plot(df)
strong_correlations = analyze_correlations(correlation_matrix)
# Print results
correlation = corr['correlation']
main()
Output
This code analyzes the California Housing dataset to understand how different
features in houses are related to each other.
+1 means perfect positive correlation (when one goes up, the other goes
up)
-1 means perfect negative correlation (when one goes up, the other goes
down)
1. A Correlation Heatmap:
2. A Pair Plot:
1. Function: load_and_prepare_data()
o Steps:
2. Function: compute_correlation_matrix(df)
0: No correlation
o Settings:
Range: -1 to 1
4. Function: create_pair_plot(df)
o Uses seaborn'spairplot
o Settings:
5. Function: analyze_correlations(correlation_matrix)
o Steps:
6. Function: main()
o Process:
6. Prints findings
7. Output Format
o Visual outputs:
Correlation heatmap
o Text output:
Code:
pandas as pd
def load_and_prepare_data():
iris = load_iris()
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
df['target'] = iris.target
# Separate features
X = data[feature_names]
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Apply PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
explained_variance_ratio = pca.explained_variance_ratio_
loadings = pca.components_
# Create figure
plt.figure(figsize=(10, 8))
targets = sorted(data['target'].unique())
target_names = sorted(data['target_names'].unique())
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()
def plot_explained_variance(pca):
plt.figure(figsize=(10, 6))
cumsum = np.cumsum(pca.explained_variance_ratio_)
plt.xlabel('Number of Components')
plt.show()
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
plt.bar(feature_names, loadings[0])
plt.xticks(rotation=45)
plt.subplot(1, 2, 2)
plt.bar(feature_names, loadings[1])
plt.xticks(rotation=45)
plt.tight_layout()
def main():
nPerforming PCA...")
print(f"PC1: {explained_variance_ratio[0]:.2%}")
print(f"PC2: {explained_variance_ratio[1]:.2%}")
print(f"Total: {sum(explained_variance_ratio):.2%}")
# Plot results
print("\nCreating visualizations...")
plot_explained_variance(pca)
visualize_feature_importance(loadings, feature_names)
print(f"{fname}: {weight:.3f}")
main()
Output
Basic Theory:
Code Functions:
1. load_and_prepare_data()
o Each row represents one flower with its features and species type
2. perform_pca()
o Returns:
Transformed data
3. plot_pca_results()
4. plot_explained_variance()
For a given set of training data examples stored in a .CSV file, implement and demonstrate
the Find-Salgorithm to output a description of the set of all hypotheses consistent with
the training examples
Code:
numpy as np
class FindS:
self.hypothesis = None
self.features = None
"""
"""
new_hypothesis = []
if hyp_val == 'ϕ':
new_hypothesis.append(ex_val)
elifex_val == hyp_val:
new_hypothesis.append(hyp_val)
else:
new_hypothesis.append('?')
return new_hypothesis
"""
Find the most specific hypothesis consistent with the training examples
Parameters:
"""
X = data.drop(columns=[target_column])
y = data[target_column]
self.features = X.columns.tolist()
# Initialize hypothesis
self.hypothesis = self.initialize_hypothesis(len(self.features))
if self.is_positive_example(y[index]):
self.hypothesis = self.generalize_hypothesis(
row.values.tolist(),
self.hypothesis
return self.hypothesis
def print_hypothesis(self):
print("\nFinal Hypothesis:")
print("〈", end='')
print("〉")
else:
try:
return pd.read_csv(filename)
except FileNotFoundError:
return None
except Exception as e:
return None
def main():
sample_data = {
df = pd.DataFrame(sample_data)
print("\nTraining Data:")
print(df)
find_s = FindS()
find_s.fit(df, target_column='PlayTennis')
# Print results
find_s.print_hypothesis()
print("\nHypothesis Interpretation:")
main()
Output
Explanation
1. Purpose
2. Hypothesis Space
3. Working Principle
4. Generalization Rules
5. Advantages
Computationally efficient
6. Limitations
7. Applications
Pattern recognition
8. Example Scenario
a. Label the first 50 points {x1,……,x50} as follows: if (xi ≤ 0.5), then xi ∊ Class1,
else xi ∊ Class1
b. Classify the remaining points, x51,……,x100 using KNN. Perform this for
k=1,2,3,4,5,20,30
Code:
import numpy as np
class KNN:
self.k = k
self.X_train = None
self.y_train = None
self.X_train = X
predictions = []
for x in X:
distances = np.abs(self.X_train - x)
k_nearest_indices = np.argsort(distances)[:self.k]
k_nearest_labels = self.y_train[k_nearest_indices]
most_common = Counter(k_nearest_labels).most_common(1)
predictions.append(most_common[0][0])
def generate_data():
X = np.random.rand(100)
y = np.zeros(100)
return X, y
plt.figure(figsize=(12, 4))
plt.xlabel('x')
plt.yticks([])
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()
boundary_points = []
if y_pred[i] != y_pred[i-1]:
boundary_points.append(X_test[i])
if boundary_points:
print(f"x = {point:.3f}")
else:
def main():
# Generate data
print("Generating dataset...")
X, y = generate_data()
sort_idx = np.argsort(X_test)
X_test = X_test[sort_idx]
for k in k_values:
knn.fit(X_train, y_train)
# Make predictions
y_pred = knn.predict(X_test)
# Plot results
analyze_boundary_points(X_test, y_pred, k)
class1_pred = np.sum(y_pred == 1)
main()
For each test point, it finds k closest training points and takes a
majority vote
2. Data Generation
3. Visualization Components
analyze_boundary_points function:
o Visualizes results
6. Key Features
Implement the non-parametric Locally Weighted Regression algorithm in order to fit data
points. Selectappropriate data set for your experiment and draw graphs
Code:
import numpy as np
return X, y
"""
Parameters:
X : array-like
y : array-like
Target values
x_pred : array-like
tau : float
Returns:
array-like
"""
X = np.ravel(X)
x_pred = np.ravel(x_pred)
y_pred = []
for x in x_pred:
W = np.diag(weights)
# Make prediction
y_pred.append(float(x_aug @ theta))
np.random.seed(42)
X, y = generate_sample_data(n_samples=100, noise=10)
# Plotting
plt.figure(figsize=(12, 6))
plt.ylabel('y')
plt.legend()
plt.grid(True)
plt.show()
Output
Explanation
1. Data Generation
2. Kernel Function
3. LOWESS Implementation
Key steps:
Disadvantages:
Computationally expensive
Code:
import pandas as pd
url =
"https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv
"
boston_df = pd.read_csv(url)
print(boston_df.columns.tolist())
pandas as pd
import warnings
warnings.filterwarnings('ignore')
print("-" * 50)
url =
"https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv
"
boston_df = pd.read_csv(url)
print(f"Number of samples:
{len(X_boston)}")
nFeatures:")
print(f"- {name}")
scaler = StandardScaler()
X_train_boston_scaled = scaler.fit_transform(X_train_boston)
X_test_boston_scaled = scaler.transform(X_test_boston)
lr_model = LinearRegression()
# Make predictions
y_pred_boston = lr_model.predict(X_test_boston_scaled)
# Calculate metrics
rmse_boston = np.sqrt(mse_boston)
feature_importance = pd.DataFrame({
'Feature': X_boston.columns,
'Coefficient': lr_model.coef_
})
feature_importance = feature_importance.sort_values('Abs_Coefficient',
ascending=False)
print("\nFeature Importance:")
print(feature_importance[['Feature', 'Coefficient']].to_string(index=False))
plt.figure(figsize=(12, 6))
plt.bar(feature_importance['Feature'], feature_importance['Coefficient'])
plt.xticks(rotation=45)
plt.xlabel('Features')
plt.ylabel('Coefficient Value')
plt.tight_layout()
plt.show()
plt.figure(figsize=(10, 6))
plt.tight_layout()
plt.show()
print("-" * 50)
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-
mpg.data'
df = df.replace('?', np.nan)
df['Horsepower'] = df['Horsepower'].astype(float)
regression X_mpg =
df[['Horsepower']].values y_mpg =
df['MPG'].values
scaler_mpg = StandardScaler()
X_mpg_scaled = scaler_mpg.fit_transform(X_mpg)
degrees = [1, 2, 3]
plt.figure(figsize=(15, 5))
poly_features = PolynomialFeatures(degree=degree)
X_train_poly = poly_features.fit_transform(X_train_mpg)
X_test_poly = poly_features.transform(X_test_mpg)
# Train model
poly_model = LinearRegression()
poly_model.fit(X_train_poly, y_train_mpg)
# Make predictions
y_pred_poly = poly_model.predict(X_test_poly)
# Calculate metrics
rmse_poly = np.sqrt(mse_poly)
# Plot results
plt.subplot(1, 3, i)
X_sort_poly = poly_features.transform(X_sort)
y_sort_pred = poly_model.predict(X_sort_poly)
plt.xlabel('Horsepower (scaled)')
plt.ylabel('MPG')
plt.legend()
plt.tight_layout()
plt.show()
Key Components:
# Data Preparation
# Model Training
# Evaluation
Implementation Steps:
# Data Preparation
# Model Training
# Evaluation
3. Key Visualizations:
5. Key Insights:
Develop a program to demonstrate the working of the decision tree algorithm. Use Breast
Cancer Data setfor building the decision tree and apply this knowledge to classify a new
sample.
Code:
matplotlib.pyplot as plt
seaborn as sns
load_breast_cancer()
X = data.data y =
data.target
dt_classifier.fit(X_train, y_train)
y_pred = dt_classifier.predict(X_test)
print("\nClassification Report:")
plt.figure(figsize=(10, 8))
cm = confusion_matrix(y_test, y_pred)
plt.title('Confusion Matrix')
plt.ylabel('True Label')
plt.figure(figsize=(20,10))
plot_tree(dt_classifier, feature_names=data.feature_names,
"""
Parameters:
sample (list or array): List of feature values in the same order as the training
data
Returns:
prediction = dt_classifier.predict(sample)
probability = dt_classifier.predict_proba(sample)
print("\nClassification Results:")
print(f"{feature}: {importance:.4f}")
nExample Classification:")
classify_new_sample(example_sample)
Output
2. Model Configuration
o random_state=42 (reproducibility)
4. Visualization Elements:
Provides:
6. Key Features:
Probability Estimates
7. Use Cases:
Risk Assessment
Develop a program to implement the Naive Bayesian classifier considering Olivetti Face
Data set for training.
Compute the accuracy of the classifier, considering a few test data sets.
Code:
import numpy as np
matplotlib.pyplot as plt
= fetch_olivetti_faces()
X = faces.data y =
faces.target
for i, ax in enumerate(axes):
ax.set_title(f'Person {y[i]}')
ax.axis('off')
plt.tight_layout()
plt.show()
nb_classifier = GaussianNB()
nb_classifier.fit(X_train, y_train)
# Make predictions
y_pred = nb_classifier.predict(X_test)
# Calculate accuracy
# Perform cross-validation
nCross-validation scores:")
print("\nClassification Report:")
print(classification_report(y_test, y_pred))
plt.figure(figsize=(12, 8))
cm = confusion_matrix(y_test, y_pred)
plt.ylabel('True Label')
plt.xlabel('Predicted Label')
X_samples = X_test[indices]
y_true = y_test[indices]
# Make predictions
y_pred = classifier.predict(X_samples)
probabilities = classifier.predict_proba(X_samples)
# Display results
for i in range(num_samples):
axes[0, i].axis('off')
axes[1, i].axis('off')
ha='center', va='center')
if y_true[i] == y_pred[i]:
else:
plt.tight_layout()
plt.show()
display_sample_faces(X, y)
if num_display> 0:
if num_display == 1:
ax = axes
else:
ax = axes[i]
ax.axis('off')
plt.tight_layout()
plt.show()
nAnalyzing misclassifications:")
Output
2. Key Functions:
c) Analyze Misclassifications:
3. Model Implementation
4. Performance Evaluation:
Misclassification analysis
5. Visualization Components:
Misclassified examples
6. Key Features:
Probability estimation
Error analysis
Cross-validation performance
7. Notable Aspects:
Develop a program to implement k-means clustering using Wisconsin Breast Cancer data
set and visualizethe clustering result.
Code:
pandas as pd
= load_breast_cancer()
X = data.data
df = pd.DataFrame(X, columns=data.feature_names)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
inertias = []
silhouette_scores = []
for k in k_values:
kmeans.fit(X)
inertias.append(kmeans.inertia_)
silhouette_scores.append(silhouette_score(X, kmeans.labels_))
ax1.set_ylabel('Inertia')
ax1.set_title('Elbow Method')
ax2.set_ylabel('Silhouette Score')
ax2.set_title('Silhouette Analysis')
plt.tight_layout()
plt.show()
return k_values[np.argmax(silhouette_scores)]
# Find optimal k
optimal_k = plot_elbow_curve(X_scaled)
cluster_labels = kmeans.fit_predict(X_scaled)
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
plt.figure(figsize=(12, 8))
plt.colorbar(scatter, label='Cluster')
plt.show()
comparison_df = pd.DataFrame({
'Actual_Diagnosis': y
})
print(pd.crosstab(comparison_df['Cluster'], comparison_df['Actual_Diagnosis'],
values=np.zeros_like(cluster_labels), aggfunc='count'))
df_analysis['Cluster'] = labels
cluster_means = df_analysis.groupby('Cluster').mean()
plt.figure(figsize=(15, 8))
xticklabels=True, yticklabels=True)
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()
return cluster_means
feature_importance = pd.DataFrame({
'Feature': feature_names,
'Importance': centroid_variance
}).sort_values('Importance', ascending=False)
plt.figure(figsize=(12, 6))
plt.tight_layout()
plt.show()
return feature_importance
if isinstance(sample, list):
sample_scaled = scaler.transform(sample)
# Predict cluster
cluster = kmeans.predict(sample_scaled)[0]
distances = kmeans.transform(sample_scaled)[0]
Output
1. Data Preparation:
2. Key Functions:
b) Cluster Analysis:
c) Feature Importance:
3. Visualization Components:
4. Model Implementation:
5. Cluster Prediction:
6. Key Features:
7. Analysis Components: