0% found this document useful (0 votes)
3 views117 pages

Machine Learning (BCSL606) Lab Manual

The document is a lab manual for the BCSL606 Machine Learning Laboratory, detailing various experiments and components related to data analysis using the California Housing dataset. It includes instructions for creating histograms, box plots, correlation matrices, and implementing machine learning algorithms like k-Nearest Neighbors and Decision Trees. The manual emphasizes exploratory data analysis (EDA) techniques to understand housing features and their relationships, providing code snippets and explanations for each experiment.

Uploaded by

suhaspr2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views117 pages

Machine Learning (BCSL606) Lab Manual

The document is a lab manual for the BCSL606 Machine Learning Laboratory, detailing various experiments and components related to data analysis using the California Housing dataset. It includes instructions for creating histograms, box plots, correlation matrices, and implementing machine learning algorithms like k-Nearest Neighbors and Decision Trees. The manual emphasizes exploratory data analysis (EDA) techniques to understand housing features and their relationships, providing code snippets and explanations for each experiment.

Uploaded by

suhaspr2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 117

Lab Manual

BCSL606 - Machine Learning


Laboratory
2022 Scheme

Even Semester 2024-2025

Dr. G. Prema Arokia Mary


Associate Professor
CSE

BCSL606 | Machine Learning Lab Page 1


Laboratory Components

1. Histograms and Boxplots Analysis (California Housing)

2. Correlation Matrix and Pair Plot (California Housing)

3. PCA Dimensionality Reduction (Iris Dataset)

4. Find-S Algorithm for Hypothesis Generation

5. k-Nearest Neighbors Classification (Generated Data)

6. Locally Weighted Regression Algorithm

7. Linear and Polynomial Regression (Boston Housing & Auto MPG)

8. Decision Tree Classifier (Breast Cancer Dataset)

9. Naive Bayes Classifier (Olivetti Face Dataset)

10.K-Means Clustering (Breast Cancer Dataset)

BCSL606 | Machine Learning Lab Page 2


Experiment-01

Develop a program to create histograms for all numerical features and analyze the
distribution of each feature. Generate box plots for all numerical features and identify
any outliers. Use California Housing dataset.

Code:

!pip install pandas numpy matplotlib seaborn scikit-learn

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

from sklearn.datasets import fetch_california_housing

# Set the style for better visualization

plt.style.use('tableau-colorblind10') # Using a built-in matplotlib style

def load_and_prepare_data():

"""Load California Housing dataset and convert to pandas DataFrame"""

housing = fetch_california_housing()

df = pd.DataFrame(housing.data, columns=housing.feature_names)

df['PRICE'] = housing.target

return df

BCSL606 | Machine Learning Lab Page 3


def create_distribution_plots(df, save_plots=False):

"""Create histograms and box plots for all numerical features"""

numerical_features = df.columns

# Calculate number of rows needed for subplot grid

n_features = len(numerical_features)

n_rows = (n_features + 1) // 2 # 2 plots per row

# Create histograms

plt.figure(figsize=(15, 5*n_rows))

for idx, feature in enumerate(numerical_features, 1):

plt.subplot(n_rows, 2, idx)

sns.histplot(data=df, x=feature, kde=True)

plt.title(f'Distribution of {feature}')

plt.xlabel(feature)

plt.ylabel('Count')

plt.tight_layout()

if save_plots:

plt.savefig('histograms.png')

plt.show()

BCSL606 | Machine Learning Lab Page 4


# Create box plots

plt.figure(figsize=(15, 5*n_rows))

for idx, feature in enumerate(numerical_features, 1):

plt.subplot(n_rows, 2, idx)

sns.boxplot(data=df[feature])

plt.title(f'Box Plot of {feature}')

plt.tight_layout()

if save_plots:

plt.savefig('boxplots.png')

plt.show()

def analyze_distributions(df):

"""Generate statistical summary and identify outliers"""

stats_summary = df.describe()

# Calculate IQR and identify outliers for each feature

outlier_summary = {}

for column in df.columns:

Q1 = df[column].quantile(0.25)

Q3 = df[column].quantile(0.75)

IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR

BCSL606 | Machine Learning Lab Page 5


upper_bound = Q3 + 1.5 * IQR

outliers = df[(df[column] <lower_bound) | (df[column] >upper_bound)][column]

outlier_summary[column] =

{ 'number_of_outliers':

len(outliers),

'percentage_of_outliers': (len(outliers) / len(df)) * 100,

'outlier_range': f"< {lower_bound:.2f} or > {upper_bound:.2f}"

return stats_summary, outlier_summary

def main():

# Load the data

df = load_and_prepare_data()

# Create visualization plots

create_distribution_plots(df)

# Analyze distributions and outliers

stats_summary, outlier_summary = analyze_distributions(df)

# Print statistical summary

BCSL606 | Machine Learning Lab Page 6


print("\nStatistical Summary:")

print(stats_summary)

# Print outlier analysis print("\

nOutlier Analysis:")

for feature, summary in outlier_summary.items():

print(f"\n{feature}:")

print(f"Number of outliers: {summary['number_of_outliers']}")

print(f"Percentage of outliers: {summary['percentage_of_outliers']:.2f}%")

print(f"Outlier range: {summary['outlier_range']}")

if name == " main ":

main()

Output

BCSL606 | Machine Learning Lab Page 7


BCSL606 | Machine Learning Lab Page 8
BCSL606 | Machine Learning Lab Page 9
BCSL606 | Machine Learning Lab Page 10
BCSL606 | Machine Learning Lab Page 11
Explanation

Understanding California Housing Data Analysis

Introduction

The code performs an exploratory data analysis (EDA) on California housing data. EDA is a
crucial first step in understanding your dataset before performing any advanced analysis or
modeling. This analysis focuses on understanding the distribution of housing features and
prices across California.

Theory Behind Each Component

Data Loading and Preparation

The California Housing dataset is a standard dataset in scikit-learn containing housing prices
and related features. The data preparation step converts this into a panda DataFrame, which is
a table-like structure where:

 Each row represents a different location in California

 Each column represents a different feature (like house price, income, population)

 The target variable (house price) is added as an additional column

Distribution Analysis

The code analyzes distributions through two main approaches:

1. Visual Analysis The distribution plots help understand how values are spread across
each feature:

o Histograms show the frequency distribution of values, revealing if data is


normally distributed, skewed, or has multiple peaks

BCSL606 | Machine Learning Lab Page 12


o Kernel Density Estimation (KDE) smooths the histogram to show the
continuous probability distribution

o Box plots reveal the median, quartiles, and potential outliers in the data

2. Statistical Analysis The code calculates key statistical measures:

o Descriptive statistics (mean, median, standard deviation) summarize central


tendency and spread

o Interquartile Range (IQR) measures variability by finding the range between


the 25th and 75th percentiles

o Outlier detection uses the 1.5 × IQR rule: any point beyond 1.5 times the IQR
from the quartiles is considered an outlier

Visualization System

The visualization system uses matplotlib and seaborn libraries because:

 Matplotlib provides the foundation for creating plots

 Seaborn adds statistical plotting functions and improves plot aesthetics

 The tableau-colorblind10 style ensures accessibility and professional appearance

Statistical Methods Used

1. Descriptive Statistics

o Mean: Average value of each feature

o Standard deviation: Measure of data spread

o Quartiles: Values that divide data into four equal parts

o Min/Max: Range of values for each feature

2. Outlier DetectionThe IQR method is used because:

o It's resistant to extreme values

o Doesn't assume normal distribution

BCSL606 | Machine Learning Lab Page 13


o Identifies values that are unusually high or low

o Formula: [Q1 - 1.5×IQR, Q3 + 1.5×IQR] defines the normal range

Significance of Each Feature

The dataset includes these meaningful features:

 Median Income: Indicates area's economic status

 House Age: Represents property age

 Average Rooms/Bedrooms: Indicates house size

 Population and Occupancy: Shows area density

 Location (Latitude/Longitude): Captures geographical factors

 Price: Target variable showing house values

Purpose of Analysis Components

1. Distribution Plots

o Help identify patterns in data

o Show if variables are normally distributed

o Reveal potential data quality issues

o Highlight relationships between features

2. Statistical Summary

o Provides numerical understanding of data

o Helps identify unusual patterns

o Supports data-driven decisions

o Validates visual observations

3. Outlier Analysis

BCSL606 | Machine Learning Lab Page 14


o Identifies unusual cases

o Helps understand extreme values

o Supports data cleaning decisions

o Reveals potential data errors

Expected Insights

This analysis helps understand:

 Typical housing prices in California

 How features vary across locations

 Unusual patterns or anomalies

 Relationships between features

 Data quality and reliability

The combination of visual and statistical analysis provides a comprehensive understanding of


California's housing market characteristics, essential for further modeling or decision-making
processes.

BCSL606 | Machine Learning Lab Page 15


Experiment-02

Develop a program to Compute the correlation matrix to understand the relationships


between pairs offeatures. Visualize the correlation matrix using a heatmap to know
which variables have strongpositive/negative correlations. Create a pair plot to visualize
pairwise relationships between features. UseCalifornia Housing dataset.

Code:

!pip install pandas numpy matplotlib seaborn scikit-learn

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

from sklearn.datasets import fetch_california_housing

def load_and_prepare_data():

"""Load California Housing dataset and convert to pandas DataFrame"""

housing = fetch_california_housing()

df = pd.DataFrame(housing.data, columns=housing.feature_names)

df['PRICE'] = housing.target

return df

BCSL606 | Machine Learning Lab Page 16


def compute_correlation_matrix(df):

"""Compute and return the correlation matrix"""

correlation_matrix = df.corr()

return correlation_matrix

def plot_correlation_heatmap(correlation_matrix):

"""Create a heatmap visualization of the correlation matrix"""

plt.figure(figsize=(12, 10))

# Create heatmap with correlation values

sns.heatmap(correlation_matrix,

annot=True, # Show correlation values

cmap='coolwarm', # Red for positive, blue for negative correlations

vmin=-1, vmax=1, # Fix the range of correlation values

center=0, # Center the colormap at 0

square=True, # Make the plot square-shaped

fmt='.2f') # Round correlation values to 2 decimal places

plt.title('Correlation Matrix Heatmap')

plt.tight_layout()

BCSL606 | Machine Learning Lab Page 17


plt.show()

def create_pair_plot(df):

"""Create a pair plot to show relationships between all features"""

# Create pair plot

sns.pairplot(df, diag_kind='kde', plot_kws={'alpha': 0.6})

plt.tight_layout()

plt.show()

def analyze_correlations(correlation_matrix):

"""Analyze and print notable correlations"""

# Get upper triangle of the correlation matrix

upper_tri =
correlation_matrix.where(np.triu(np.ones(correlation_matrix.shape),
k=1).astype(bool))

# Find strong correlations (absolute value > 0.5)

strong_correlations = []

for col in upper_tri.columns:

for idx, value in upper_tri[col].items():

if value is not None and abs(value) > 0.5:

BCSL606 | Machine Learning Lab Page 18


strong_correlations.append({

'features': (idx, col),

'correlation': value

})

# Sort by absolute correlation value

strong_correlations.sort(key=lambda x: abs(x['correlation']), reverse=True)

return strong_correlations

def main():

# Load the data

print("Loading California Housing dataset...")

df = load_and_prepare_data()

# Compute correlation matrix print("\

nComputing correlation matrix...")

correlation_matrix = compute_correlation_matrix(df)

# Plot correlation heatmap

BCSL606 | Machine Learning Lab Page 19


print("\nCreating correlation heatmap...")

plot_correlation_heatmap(correlation_matrix)

# Create pair plot

print("\nCreating pair plot (this may take a moment)...")

create_pair_plot(df)

# Analyze and print notable correlations print("\

nAnalyzing strong correlations...")

strong_correlations = analyze_correlations(correlation_matrix)

# Print results

print("\nStrong correlations found (|correlation| > 0.5):")

for corr in strong_correlations:

feature1, feature2 = corr['features']

correlation = corr['correlation']

correlation_type = "positive" if correlation > 0 else "negative"

print(f"{feature1} and {feature2}: {correlation:.3f} ({correlation_type}


correlation)")

BCSL606 | Machine Learning Lab Page 20


if name == " main ":

main()

Output

BCSL606 | Machine Learning Lab Page 21


Explanation

This code analyzes the California Housing dataset to understand how different
features in houses are related to each other.

BCSL606 | Machine Learning Lab Page 22


The main purpose is to find correlations between different housing features. A
correlation shows how strongly two features are related. For example, it can tell
us if house prices tend to go up when the number of rooms increases.

Correlation values range from -1 to +1:

 +1 means perfect positive correlation (when one goes up, the other goes
up)

 0 means no correlation (no relationship)

 -1 means perfect negative correlation (when one goes up, the other goes
down)

The code creates two main visualizations:

1. A Correlation Heatmap:

 Shows all correlations in a color-coded matrix

 Red colors show positive correlations

 Blue colors show negative correlations

 Darker colors mean stronger relationships

 Numbers in each cell show the exact correlation value

2. A Pair Plot:

 Shows scatter plots for every pair of features

 Helps visualize relationships between variables

 Shows distribution of each feature on the diagonal

BCSL606 | Machine Learning Lab Page 23


The code also automatically finds strong correlations (values above 0.5 or below
-0.5) and prints them, telling you which features are strongly related and whether
the relationship is positive or negative.

This analysis helps understand patterns in the housing market, like:

 Which features most strongly affect house prices

 Which features tend to occur together

 Whether features have expected or surprising relationships

1. Function: load_and_prepare_data()

o Purpose: Loads California Housing dataset

o Steps:

 Fetches data using sklearn'sfetch_california_housing()

 Converts to pandas DataFrame

 Adds house prices as a target column

 Returns complete dataset

2. Function: compute_correlation_matrix(df)

o Purpose: Calculates correlations between all features

o Uses pandas' df.corr() to compute Pearson correlation coefficients

o Returns a matrix where values range from -1 to 1

 1: Perfect positive correlation

 0: No correlation

 -1: Perfect negative correlation

BCSL606 | Machine Learning Lab Page 24


3. Function: plot_correlation_heatmap(correlation_matrix)

o Purpose: Creates visual heatmap of correlations

o Settings:

 Figure size: 12x10

 Shows actual correlation values (annot=True)

 Uses coolwarm color scheme (red=positive, blue=negative)

 Range: -1 to 1

 Formats numbers to 2 decimal places

4. Function: create_pair_plot(df)

o Purpose: Shows relationships between all pairs of features

o Uses seaborn'spairplot

o Settings:

 Diagonal: Kernel Density Estimation (kde)

 Alpha: 0.6 for transparency

 Shows scatter plots for all feature combinations

5. Function: analyze_correlations(correlation_matrix)

o Purpose: Identifies strong correlations

o Steps:

 Gets upper triangle of correlation matrix

 Finds correlations stronger than ±0.5

BCSL606 | Machine Learning Lab Page 25


 Sorts results by correlation strength

 Returns list of strong correlations

6. Function: main()

o Purpose: Orchestrates the analysis workflow

o Process:

1. Loads housing data

2. Computes correlation matrix

3. Creates heatmap visualization

4. Generates pair plot

5. Analyzes strong correlations

6. Prints findings

7. Output Format

o Visual outputs:

 Correlation heatmap

 Pair plot matrix

o Text output:

 Lists strong correlations

 Shows correlation strength

 Indicates if correlation is positive/negative

BCSL606 | Machine Learning Lab Page 26


Experiment-03

Develop a program to implement Principal Component Analysis (PCA) for


reducing the dimensionality ofthe Iris dataset from 4 features to 2

Code:

!pip install pandas numpy matplotlib scikit-learn

import numpy as np import

pandas as pd

import matplotlib.pyplot as plt

from sklearn.datasets import load_iris

from sklearn.preprocessing import StandardScaler from

sklearn.decomposition import PCA

def load_and_prepare_data():

"""Load Iris dataset and prepare it for PCA""" #

Load the iris dataset

iris = load_iris()

# Create a DataFrame with feature names

df = pd.DataFrame(data=iris.data, columns=iris.feature_names)

BCSL606 | Machine Learning Lab Page 27


# Add target variable

df['target'] = iris.target

df['target_names'] = pd.Categorical.from_codes(iris.target, iris.target_names)

return df, iris.feature_names

def perform_pca(data, feature_names):

"""Perform PCA on the dataset"""

# Separate features

X = data[feature_names]

# Standardize the features

scaler = StandardScaler()

X_scaled = scaler.fit_transform(X)

# Apply PCA

pca = PCA(n_components=2)

X_pca = pca.fit_transform(X_scaled)

BCSL606 | Machine Learning Lab Page 28


# Calculate explained variance ratio

explained_variance_ratio = pca.explained_variance_ratio_

# Get component loadings

loadings = pca.components_

return X_pca, explained_variance_ratio, loadings, pca

def plot_pca_results(X_pca, data, explained_variance_ratio):

"""Plot the PCA results"""

# Create figure

plt.figure(figsize=(10, 8))

# Create scatter plot for each class

targets = sorted(data['target'].unique())

target_names = sorted(data['target_names'].unique())

for target, target_name in zip(targets, target_names):

mask = data['target'] == target

plt.scatter(X_pca[mask, 0], X_pca[mask, 1],

BCSL606 | Machine Learning Lab Page 29


label=target_name, alpha=0.8)

# Add labels and title

plt.xlabel(f'First Principal Component (Explains


{explained_variance_ratio[0]:.2%} of variance)')

plt.ylabel(f'Second Principal Component (Explains


{explained_variance_ratio[1]:.2%} of variance)')

plt.title('PCA of Iris Dataset')

plt.legend()

plt.grid(True, alpha=0.3)

plt.show()

def plot_explained_variance(pca):

"""Plot cumulative explained variance ratio"""

plt.figure(figsize=(10, 6))

cumsum = np.cumsum(pca.explained_variance_ratio_)

plt.plot(range(1, len(cumsum) + 1), cumsum, 'bo-')

plt.xlabel('Number of Components')

plt.ylabel('Cumulative Explained Variance Ratio')

plt.title('Explained Variance vs. Number of Components')

BCSL606 | Machine Learning Lab Page 30


plt.grid(True, alpha=0.3)

plt.show()

def visualize_feature_importance(loadings, feature_names):

"""Visualize feature importance in each principal component"""

plt.figure(figsize=(12, 6))

# Plot for PC1

plt.subplot(1, 2, 1)

plt.bar(feature_names, loadings[0])

plt.title('Feature Weights in First Principal Component')

plt.xticks(rotation=45)

# Plot for PC2

plt.subplot(1, 2, 2)

plt.bar(feature_names, loadings[1])

plt.title('Feature Weights in Second Principal Component')

plt.xticks(rotation=45)

plt.tight_layout()

BCSL606 | Machine Learning Lab Page 31


plt.show()

def main():

# Load and prepare data

print("Loading Iris dataset...")

data, feature_names = load_and_prepare_data()

# Perform PCA print("\

nPerforming PCA...")

X_pca, explained_variance_ratio, loadings, pca = perform_pca(data,


feature_names)

# Print explained variance print("\

nExplained Variance Ratio:")

print(f"PC1: {explained_variance_ratio[0]:.2%}")

print(f"PC2: {explained_variance_ratio[1]:.2%}")

print(f"Total: {sum(explained_variance_ratio):.2%}")

# Plot results

print("\nCreating visualizations...")

BCSL606 | Machine Learning Lab Page 32


plot_pca_results(X_pca, data, explained_variance_ratio)

plot_explained_variance(pca)

visualize_feature_importance(loadings, feature_names)

# Print feature importance

print("\nFeature Weights in Principal Components:")

for i, component in enumerate(loadings):

print(f"\nPrincipal Component {i+1}:")

for fname, weight in zip(feature_names, component):

print(f"{fname}: {weight:.3f}")

if name == " main ":

main()

Output

BCSL606 | Machine Learning Lab Page 33


BCSL606 | Machine Learning Lab Page 34
Explanation

Basic Theory:

PCA is a technique that reduces the dimensionality of data while preserving as


much important information as possible. It transforms high-dimensional data
into a new set of features called principal components.

Code Functions:

1. load_and_prepare_data()

BCSL606 | Machine Learning Lab Page 35


o Loads the famous Iris dataset (contains measurements of
different iris flowers)

o Creates a DataFrame with flower measurements and their


species names

o Each row represents one flower with its features and species type

2. perform_pca()

o Standardizes the data (makes all features have same scale)

o Applies PCA to reduce data to 2 dimensions

o Returns:

 Transformed data

 How much information each component preserves

 Feature weights in each component

3. plot_pca_results()

o Creates a scatter plot showing flowers in the new 2D space

o Different colors for different iris species

o Shows how well species are separated after PCA

o Labels show how much variance each component explains

4. plot_explained_variance()

o Shows how much total information is preserved as we add


components

o Helps decide how many components to keep

BCSL606 | Machine Learning Lab Page 36


5. visualize_feature_importance()

o Creates bar plots showing which original features contribute


most to each principal component

o Helps understand what each new component means

What the Code Does:

1. Takes 4-dimensional iris flower measurements

2. Reduces them to 2 dimensions while keeping most important patterns

3. Shows how well different iris species can be distinguished

4. Tells us which original measurements are most important

Why This is Useful:

 Helps visualize high-dimensional data

 Finds most important patterns in the data

 Shows which original features matter most

 Can help classify different types of iris flowers using


fewer measurements

BCSL606 | Machine Learning Lab Page 37


Experiment-04

For a given set of training data examples stored in a .CSV file, implement and demonstrate
the Find-Salgorithm to output a description of the set of all hypotheses consistent with
the training examples

Code:

import pandas as pd import

numpy as np

class FindS:

def init (self):

self.hypothesis = None

self.features = None

def initialize_hypothesis(self, num_features):

"""Initialize the most specific hypothesis"""

return ['ϕ'] * num_features

def is_positive_example(self, target):

"""Check if the example is positive"""

return target == 'Yes'

BCSL606 | Machine Learning Lab Page 38


def generalize_hypothesis(self, example, current_hypothesis):

"""

Generalize the hypothesis to be consistent with the positive example

"""

new_hypothesis = []

for ex_val, hyp_val in zip(example, current_hypothesis):

# If hypothesis value is 'ϕ' (null), use the example value

if hyp_val == 'ϕ':

new_hypothesis.append(ex_val)

# If values match, keep the value

elifex_val == hyp_val:

new_hypothesis.append(hyp_val)

# If values don't match, generalize to '?'

else:

new_hypothesis.append('?')

return new_hypothesis

BCSL606 | Machine Learning Lab Page 39


def fit(self, data, target_column):

"""

Find the most specific hypothesis consistent with the training examples

Parameters:

data: pandas DataFrame containing the training examples

target_column: name of the target column

"""

# Separate features and target

X = data.drop(columns=[target_column])

y = data[target_column]

# Store feature names

self.features = X.columns.tolist()

# Initialize hypothesis

self.hypothesis = self.initialize_hypothesis(len(self.features))

# Process each training example

for index, row in X.iterrows():

BCSL606 | Machine Learning Lab Page 40


# Only consider positive examples

if self.is_positive_example(y[index]):

self.hypothesis = self.generalize_hypothesis(

row.values.tolist(),

self.hypothesis

return self.hypothesis

def print_hypothesis(self):

"""Print the current hypothesis in a readable format"""

if self.hypothesis and self.features:

print("\nFinal Hypothesis:")

print("〈", end='')

for feature, value in zip(self.features, self.hypothesis):

print(f"{feature} = {value}, ", end='')

print("〉")

else:

print("No hypothesis found. Please run fit() first.")

BCSL606 | Machine Learning Lab Page 41


def load_data(filename):

"""Load data from CSV file"""

try:

return pd.read_csv(filename)

except FileNotFoundError:

print(f"Error: File '{filename}' not found.")

return None

except Exception as e:

print(f"Error loading data: {str(e)}")

return None

def main():

# Example usage with sample data

print("Creating sample training data...")

# Create sample data if no file is provided

sample_data = {

'Sky': ['Sunny', 'Sunny', 'Rainy', 'Sunny'],

'Temperature': ['Warm', 'Warm', 'Cold', 'Warm'],

'Humidity': ['High', 'High', 'High', 'High'],

BCSL606 | Machine Learning Lab Page 42


'Wind': ['Weak', 'Strong', 'Weak', 'Weak'],

'PlayTennis': ['Yes', 'Yes', 'No', 'Yes']

df = pd.DataFrame(sample_data)

print("\nTraining Data:")

print(df)

# Initialize and run Find-S algorithm

print("\nRunning Find-S algorithm...")

find_s = FindS()

find_s.fit(df, target_column='PlayTennis')

# Print results

find_s.print_hypothesis()

print("\nHypothesis Interpretation:")

print("- '?' means any value is acceptable for that attribute")

print("- 'ϕ' means no value has been observed (null)")

print("- Specific values indicate required values for that attribute")

BCSL606 | Machine Learning Lab Page 43


if name == " main ":

main()

Output

Explanation

Key Concepts of Find-S Algorithm:

1. Purpose

 Find-S aims to find the most specific hypothesis that is consistent


with training examples

 It particularly focuses on positive training examples while


ignoring negative ones

BCSL606 | Machine Learning Lab Page 44


 The algorithm tries to identify essential patterns in features that lead to
positive outcomes

2. Hypothesis Space

 Starts with the most specific hypothesis possible (null values)

 Gradually generalizes this hypothesis as it processes positive examples

 Uses three types of values in hypothesis:

o Specific values (required conditions)

o '?' (any value allowed)

o 'ϕ' (null/initial state)

3. Working Principle

 Only processes positive examples in the training data

 When a positive example is encountered, compares each attribute


with current hypothesis

 Generalizes hypothesis only when necessary to accommodate new


positive examples

 Never becomes more specific once generalized

4. Generalization Rules

 If attribute matches current hypothesis: Keep current value

 If current hypothesis is null (ϕ): Use the example's value

 If mismatch occurs: Generalize to '?' (any value acceptable)

5. Advantages

BCSL606 | Machine Learning Lab Page 45


 Simple to understand and implement

 Computationally efficient

 Works well with consistent data

 Provides clear, interpretable results

6. Limitations

 Ignores negative examples completely

 Cannot handle inconsistent training data

 May not find the most general hypothesis

 Assumes noise-free training data

7. Applications

 Concept learning problems

 Pattern recognition

 Simple classification tasks

 Educational purposes to understand basic machine learning concepts

8. Example Scenario

 Consider learning when to play tennis based on weather conditions

 Features might include sky condition, temperature, humidity, wind

 Algorithm learns which conditions must be present for playing tennis

 Gradually generalizes conditions that aren't strictly necessary

BCSL606 | Machine Learning Lab Page 46


Experiment-05

Develop a program to implement k-Nearest Neighbour algorithm to classify the randomly


generated 100values of x in the range of [0,1]. Perform the following based on dataset
generated.

a. Label the first 50 points {x1,……,x50} as follows: if (xi ≤ 0.5), then xi ∊ Class1,
else xi ∊ Class1
b. Classify the remaining points, x51,……,x100 using KNN. Perform this for
k=1,2,3,4,5,20,30

Code:

import numpy as np

import matplotlib.pyplot as plt from

collections import Counter

class KNN:

def init (self, k):

self.k = k

self.X_train = None

self.y_train = None

def fit(self, X, y):

"""Store training data"""

self.X_train = X

BCSL606 | Machine Learning Lab Page 47


self.y_train = y

def predict(self, X):

"""Predict class for each input value"""

predictions = []

for x in X:

# Calculate distances to all training points

distances = np.abs(self.X_train - x)

# Get indices of k nearest neighbors

k_nearest_indices = np.argsort(distances)[:self.k]

# Get classes of k nearest neighbors

k_nearest_labels = self.y_train[k_nearest_indices]

# Perform majority voting

most_common = Counter(k_nearest_labels).most_common(1)

predictions.append(most_common[0][0])

BCSL606 | Machine Learning Lab Page 48


return np.array(predictions)

def generate_data():

"""Generate and label the dataset"""

# Generate 100 random points in [0,1]

np.random.seed(42) # For reproducibility

X = np.random.rand(100)

# Label first 50 points

y = np.zeros(100)

y[:50] = np.where(X[:50] <= 0.5, 1, 2)

return X, y

def plot_results(X_train, y_train, X_test, y_pred, k):

"""Plot the results for a given k value"""

plt.figure(figsize=(12, 4))

# Plot training data

plt.scatter(X_train[y_train == 1], np.zeros_like(X_train[y_train == 1]),

BCSL606 | Machine Learning Lab Page 49


c='blue', label='Class 1 (Training)', marker='o')

plt.scatter(X_train[y_train == 2], np.zeros_like(X_train[y_train == 2]),

c='red', label='Class 2 (Training)', marker='o')

# Plot test data predictions

plt.scatter(X_test[y_pred == 1], np.ones_like(X_test[y_pred == 1])*0.1,

c='lightblue', label='Class 1 (Predicted)', marker='^')

plt.scatter(X_test[y_pred == 2], np.ones_like(X_test[y_pred == 2])*0.1,

c='lightcoral', label='Class 2 (Predicted)', marker='^')

plt.title(f'KNN Classification Results (k={k})')

plt.xlabel('x')

plt.yticks([])

plt.legend()

plt.grid(True, alpha=0.3)

plt.show()

def analyze_boundary_points(X_test, y_pred, k):

"""Analyze and print details about boundary points"""

boundary_points = []

BCSL606 | Machine Learning Lab Page 50


# Find points where predictions change

for i in range(1, len(y_pred)):

if y_pred[i] != y_pred[i-1]:

boundary_points.append(X_test[i])

if boundary_points:

print(f"\nDecision boundaries for k={k}:")

for point in sorted(boundary_points):

print(f"x = {point:.3f}")

else:

print(f"\nNo clear decision boundaries found for k={k}")

def main():

# Generate data

print("Generating dataset...")

X, y = generate_data()

# Split into training and test sets

X_train, y_train = X[:50], y[:50]

BCSL606 | Machine Learning Lab Page 51


X_test, y_test = X[50:], y[50:]

# Sort test data for better visualization

sort_idx = np.argsort(X_test)

X_test = X_test[sort_idx]

# Try different k values

k_values = [1, 2, 3, 4, 5, 20, 30]

for k in k_values:

print(f"\nPerforming classification with k={k}")

# Create and train KNN

classifier knn = KNN(k=k)

knn.fit(X_train, y_train)

# Make predictions

y_pred = knn.predict(X_test)

# Plot results

BCSL606 | Machine Learning Lab Page 52


plot_results(X_train, y_train, X_test, y_pred, k)

# Analyze decision boundaries

analyze_boundary_points(X_test, y_pred, k)

# Calculate and print summary statistics

class1_pred = np.sum(y_pred == 1)

class2_pred = np.sum(y_pred == 2) print(f"\

nPrediction Summary for k={k}:")

print(f"Class 1: {class1_pred} points ({class1_pred/len(y_pred)*100:.1f}


%)")

print(f"Class 2: {class2_pred} points ({class2_pred/len(y_pred)*100:.1f}


%)")

if name == " main ":

main()

BCSL606 | Machine Learning Lab Page 53


Output

BCSL606 | Machine Learning Lab Page 54


BCSL606 | Machine Learning Lab Page 55
BCSL606 | Machine Learning Lab Page 56
Explanation

1. Core KNN Implementation

 The KNN class implements the K-Nearest Neighbors algorithm with


two main methods:

o fit: Stores training data and labels

o predict: Makes predictions by finding k nearest neighbors


and using majority voting

 The algorithm uses absolute distance (np.abs) to measure proximity


between points

 For each test point, it finds k closest training points and takes a
majority vote

2. Data Generation

 Creates a synthetic dataset with 100 random points in range [0,1]

BCSL606 | Machine Learning Lab Page 57


 First 50 points are labeled based on a simple rule:

o Points ≤ 0.5 get label 1

o Points > 0.5 get label 2

 Data is split into training (first 50 points) and testing (remaining


50 points)

3. Visualization Components

 plot_results function creates visual representation showing:

o Training data points (blue for class 1, red for class 2)

o Predicted classifications (light blue/coral triangles)

o Clear legend and grid for better readability

o Uses different markers for training (circles) vs


predictions (triangles)

4. Decision Boundary Analysis

 analyze_boundary_points function:

o Identifies points where predictions change from one class


to another

o Prints the x-coordinates of these boundary points

o Helps understand where the algorithm switches between classes

5. Main Execution Flow

 Tests multiple k values: [1, 2, 3, 4, 5, 20, 30]

 For each k value:

BCSL606 | Machine Learning Lab Page 58


o Creates and trains KNN classifier

o Makes predictions on test data

o Visualizes results

o Analyzes decision boundaries

o Prints summary statistics (percentage of each class)

6. Key Features

 Uses numpy for efficient numerical computations

 Implements Counter for majority voting

 Includes comprehensive visualization

 Provides detailed analysis of classification boundaries

 Shows impact of different k values on predictions

7. Insights from Implementation

 Smaller k values lead to more complex decision boundaries

 Larger k values create smoother, more generalized boundaries

 The choice of k significantly impacts classification results

 Visualization helps understand algorithm behavior

BCSL606 | Machine Learning Lab Page 59


Experiment-06

Implement the non-parametric Locally Weighted Regression algorithm in order to fit data
points. Selectappropriate data set for your experiment and draw graphs

Code:

import numpy as np

import matplotlib.pyplot as plt

from sklearn.datasets import make_regression

def generate_sample_data(n_samples=100, noise=10):

"""Generate sample data with non-linear pattern"""

X = np.linspace(0, 10, n_samples)

y = 2 * np.sin(X) + X/2 + np.random.normal(0, noise/10, n_samples)

return X, y

def kernel(x, x_i, tau=0.5):

"""Gaussian kernel function for weight calculation""" return

np.exp(-(x - x_i)**2 / (2 * tau**2))

def lowess(X, y, x_pred, tau=0.5):

"""

BCSL606 | Machine Learning Lab Page 60


Locally Weighted Regression implementation

Parameters:

X : array-like

Training input features

y : array-like

Target values

x_pred : array-like

Points at which to make predictions

tau : float

Bandwidth parameter controlling smoothness

Returns:

array-like

Predicted values at x_pred points

"""

# Ensure arrays are 1D

X = np.ravel(X)

BCSL606 | Machine Learning Lab Page 61


y = np.ravel(y)

x_pred = np.ravel(x_pred)

y_pred = []

for x in x_pred:

# Calculate weights for all points

weights = kernel(x, X, tau)

# Weighted least squares matrices

W = np.diag(weights)

X_aug = np.column_stack([np.ones_like(X), X]) # Add bias term

# Calculate weighted least squares parameters

theta = np.linalg.inv(X_aug.T @ W @ X_aug) @ X_aug.T @ W @ y

# Make prediction

x_aug = np.array([1, x])

y_pred.append(float(x_aug @ theta))

BCSL606 | Machine Learning Lab Page 62


return np.array(y_pred)

# Generate sample data

np.random.seed(42)

X, y = generate_sample_data(n_samples=100, noise=10)

# Generate points for prediction

X_pred = np.linspace(0, 10, 200)

# Fit LOWESS with different bandwidth parameters

y_pred_smooth = lowess(X, y, X_pred, tau=0.3) # More local fitting

y_pred_medium = lowess(X, y, X_pred, tau=0.8) # Medium smoothing

y_pred_rough = lowess(X, y, X_pred, tau=2.0) # More global fitting

# Plotting

plt.figure(figsize=(12, 6))

plt.scatter(X, y, color='blue', alpha=0.5, label='Data points')

plt.plot(X_pred, y_pred_smooth, 'r-', label='τ = 0.3 (More local)', linewidth=2)

plt.plot(X_pred, y_pred_medium, 'g-', label='τ = 0.8 (Medium)', linewidth=2)

plt.plot(X_pred, y_pred_rough, 'y-', label='τ = 2.0 (More global)', linewidth=2)

BCSL606 | Machine Learning Lab Page 63


plt.xlabel('X')

plt.ylabel('y')

plt.title('Locally Weighted Regression with Different Bandwidth Parameters')

plt.legend()

plt.grid(True)

plt.show()

Output

Explanation

1. Data Generation

def generate_sample_data(n_samples=100, noise=10):

 Creates non-linear sample data using sine function

BCSL606 | Machine Learning Lab Page 64


 Adds random noise to make it more realistic

 Pattern is: y = 2 * sin(x) + x/2 + noise

2. Kernel Function

def kernel(x, x_i, tau=0.5):

return np.exp(-(x - x_i)**2 / (2 * tau**2))

 Implements Gaussian kernel for weight calculation

 Gives higher weights to nearby points

 tau (bandwidth) controls how quickly weight decreases with distance

 Smaller tau = more local fitting

 Larger tau = more global smoothing

3. LOWESS Implementation

def lowess(X, y, x_pred, tau=0.5):

Key steps:

 For each prediction point:

o Calculate weights for all training points using kernel function

o Create weight matrix (W) and augmented feature matrix (X_aug)

o Solve weighted least squares: θ = (X^T W X)^(-1) X^T W y

o Make prediction using calculated parameters

BCSL606 | Machine Learning Lab Page 65


4. Visualization Setup

 Generates 100 sample points with noise

 Creates 200 evenly spaced points for prediction curve

 Tests three different bandwidth (tau) values:

o τ = 0.3: More local fitting (follows data closely)

o τ = 0.8: Medium smoothing

o τ = 2.0: More global fitting (smoother curve)

5. Key Characteristics of LOWESS

 Non-parametric regression technique

 Adapts to local structure of data

 Bandwidth parameter controls smoothness:

o Small tau: More flexible, might overfit

o Large tau: Smoother, might underfit

 Computationally intensive (calculates weights for each prediction)

6. Main Differences in Results

 Red line (τ = 0.3): Follows local variations closely

 Green line (τ = 0.8): Balanced between local and global

 Yellow line (τ = 2.0): Shows general trend, ignores local variations

BCSL606 | Machine Learning Lab Page 66


7. Advantages and Disadvantages Advantages:

 No assumption about global function shape

 Handles non-linear relationships well

 Flexible local fitting

Disadvantages:

 Computationally expensive

 Sensitive to bandwidth parameter

 Can perform poorly at boundaries

BCSL606 | Machine Learning Lab Page 67


Experiment-07

Develop a program to demonstrate the working of Linear Regression and Polynomial


Regression. UseBoston Housing Dataset for Linear Regression and Auto MPG Dataset
(for vehicle fuel efficiency prediction)for Polynomial Regression.

Code:

import pandas as pd

# Load Boston Housing dataset

url =
"https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv
"

boston_df = pd.read_csv(url)

# Print column names

print("Available columns in the dataset:")

print(boston_df.columns.tolist())

import numpy as np import

pandas as pd

import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split

BCSL606 | Machine Learning Lab Page 68


from sklearn.linear_model import LinearRegression

from sklearn.preprocessing import PolynomialFeatures, StandardScaler

from sklearn.metrics import mean_squared_error, r2_score

import warnings

warnings.filterwarnings('ignore')

# Part 1: Linear Regression with Boston Housing Dataset

print("Part 1: Linear Regression - Boston Housing Dataset")

print("-" * 50)

# Load Boston Housing dataset

url =
"https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv
"

boston_df = pd.read_csv(url)

# Features and target (using correct column names)

X_boston = boston_df.drop('medv', axis=1) # All columns except target

y_boston = boston_df['medv'] # median house value

# Print dataset info

BCSL606 | Machine Learning Lab Page 69


print("\nDataset Information:")

print(f"Number of samples:

{len(X_boston)}")

print(f"Number of features: {len(X_boston.columns)}") print("\

nFeatures:")

for name in X_boston.columns:

print(f"- {name}")

# Split the data

X_train_boston, X_test_boston, y_train_boston, y_test_boston =


train_test_split(

X_boston, y_boston, test_size=0.2, random_state=42

# Scale the features

scaler = StandardScaler()

X_train_boston_scaled = scaler.fit_transform(X_train_boston)

X_test_boston_scaled = scaler.transform(X_test_boston)

# Train Linear Regression model

lr_model = LinearRegression()

BCSL606 | Machine Learning Lab Page 70


lr_model.fit(X_train_boston_scaled, y_train_boston)

# Make predictions

y_pred_boston = lr_model.predict(X_test_boston_scaled)

# Calculate metrics

mse_boston = mean_squared_error(y_test_boston, y_pred_boston)

rmse_boston = np.sqrt(mse_boston)

r2_boston = r2_score(y_test_boston, y_pred_boston)

print("\nLinear Regression Results:")

print(f"Mean Squared Error: {mse_boston:.2f}")

print(f"Root Mean Squared Error: {rmse_boston:.2f}")

print(f"R² Score: {r2_boston:.2f}")

# Feature importance analysis

feature_importance = pd.DataFrame({

'Feature': X_boston.columns,

'Coefficient': lr_model.coef_

})

BCSL606 | Machine Learning Lab Page 71


feature_importance['Abs_Coefficient'] = abs(feature_importance['Coefficient'])

feature_importance = feature_importance.sort_values('Abs_Coefficient',
ascending=False)

print("\nFeature Importance:")

print(feature_importance[['Feature', 'Coefficient']].to_string(index=False))

# Visualize feature importance

plt.figure(figsize=(12, 6))

plt.bar(feature_importance['Feature'], feature_importance['Coefficient'])

plt.xticks(rotation=45)

plt.title('Feature Importance in Boston Housing Price Prediction')

plt.xlabel('Features')

plt.ylabel('Coefficient Value')

plt.tight_layout()

plt.show()

# Plot actual vs predicted values

plt.figure(figsize=(10, 6))

plt.scatter(y_test_boston, y_pred_boston, alpha=0.5)

BCSL606 | Machine Learning Lab Page 72


plt.plot([y_test_boston.min(), y_test_boston.max()], [y_test_boston.min(),
y_test_boston.max()], 'r--', lw=2)

plt.xlabel('Actual Prices ($1000s)')

plt.ylabel('Predicted Prices ($1000s)')

plt.title('Actual vs Predicted Housing Prices')

plt.tight_layout()

plt.show()

# Part 2: Polynomial Regression with Auto MPG Dataset

print("\nPart 2: Polynomial Regression - Auto MPG Dataset")

print("-" * 50)

# Load Auto MPG dataset

url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-
mpg.data'

column_names = ['MPG', 'Cylinders', 'Displacement', 'Horsepower', 'Weight',

'Acceleration', 'Model Year', 'Origin', 'Car Name']

df = pd.read_csv(url, names=column_names, delim_whitespace=True)

# Clean the data

df = df.replace('?', np.nan)

BCSL606 | Machine Learning Lab Page 73


df = df.dropna()

df['Horsepower'] = df['Horsepower'].astype(float)

# Select features for polynomial

regression X_mpg =

df[['Horsepower']].values y_mpg =

df['MPG'].values

# Scale features for polynomial regression

scaler_mpg = StandardScaler()

X_mpg_scaled = scaler_mpg.fit_transform(X_mpg)

# Split the data

X_train_mpg, X_test_mpg, y_train_mpg, y_test_mpg =

train_test_split( X_mpg_scaled, y_mpg, test_size=0.2, random_state=42

# Create and train models with different polynomial degrees

degrees = [1, 2, 3]

plt.figure(figsize=(15, 5))

BCSL606 | Machine Learning Lab Page 74


for i, degree in enumerate(degrees, 1):

# Create polynomial features

poly_features = PolynomialFeatures(degree=degree)

X_train_poly = poly_features.fit_transform(X_train_mpg)

X_test_poly = poly_features.transform(X_test_mpg)

# Train model

poly_model = LinearRegression()

poly_model.fit(X_train_poly, y_train_mpg)

# Make predictions

y_pred_poly = poly_model.predict(X_test_poly)

# Calculate metrics

mse_poly = mean_squared_error(y_test_mpg, y_pred_poly)

rmse_poly = np.sqrt(mse_poly)

r2_poly = r2_score(y_test_mpg, y_pred_poly)

print(f"\nPolynomial Regression (degree {degree}) Results:")

print(f"Mean Squared Error: {mse_poly:.2f}")

BCSL606 | Machine Learning Lab Page 75


print(f"Root Mean Squared Error: {rmse_poly:.2f}")

print(f"R² Score: {r2_poly:.2f}")

# Plot results

plt.subplot(1, 3, i)

plt.scatter(X_test_mpg, y_test_mpg, color='blue', alpha=0.5, label='Actual')

# Sort points for smooth curve

X_sort = np.sort(X_test_mpg, axis=0)

X_sort_poly = poly_features.transform(X_sort)

y_sort_pred = poly_model.predict(X_sort_poly)

plt.plot(X_sort, y_sort_pred, color='red', label='Predicted')

plt.xlabel('Horsepower (scaled)')

plt.ylabel('MPG')

plt.title(f'Polynomial Regression (degree {degree})')

plt.legend()

plt.tight_layout()

plt.show()

BCSL606 | Machine Learning Lab Page 76


Output

BCSL606 | Machine Learning Lab Page 77


BCSL606 | Machine Learning Lab Page 78
Explanation

1. Part 1: Linear Regression with Boston Housing Dataset

Key Components:

 Uses the Boston Housing dataset to predict house prices

 Features include various neighborhood characteristics

 Target variable is 'medv' (median house value)

BCSL606 | Machine Learning Lab Page 79


Implementation Steps:

# Data Preparation

- Loads dataset from URL

- Splits features (X) and target (y)

- Uses train_test_split for data division

- Applies StandardScaler for feature normalization

# Model Training

- Creates LinearRegression model

- Fits model on scaled training data

- Makes predictions on test set

# Evaluation

- Calculates MSE, RMSE, and R² metrics

- Analyzes feature importance through coefficients

- Visualizes feature importance with bar plot

- Creates actual vs predicted scatter plot

2. Part 2: Polynomial Regression with Auto MPG

Dataset Key Components:

BCSL606 | Machine Learning Lab Page 80


 Uses Auto MPG dataset to predict fuel efficiency

 Focuses on Horsepower as main feature

 Tests three polynomial degrees (1, 2, 3)

Implementation Steps:

# Data Preparation

- Loads and cleans MPG dataset

- Handles missing values ('?')

- Scales features using StandardScaler

# Model Training

- Creates polynomial features for each degree

- Trains separate models for each degree

- Makes predictions using each model

# Evaluation

- Calculates metrics for each polynomial degree

- Creates subplots showing fit for each degree

- Compares performance across degrees

3. Key Visualizations:

 Feature importance bar chart for Boston Housing

BCSL606 | Machine Learning Lab Page 81


 Actual vs predicted scatter plot for house prices

 Three subplots showing polynomial fits of different degrees

4. Important Metrics Tracked:

 Mean Squared Error (MSE)

 Root Mean Squared Error (RMSE)

 R² Score (coefficient of determination)

5. Key Insights:

 Shows how feature scaling improves model performance

 Demonstrates overfitting risk with higher polynomial degrees

 Illustrates importance of different features in housing prices

BCSL606 | Machine Learning Lab Page 82


Experiment-08

Develop a program to demonstrate the working of the decision tree algorithm. Use Breast
Cancer Data setfor building the decision tree and apply this knowledge to classify a new
sample.

Code:

from sklearn.datasets import load_breast_cancer from

sklearn.model_selection import train_test_split

from sklearn.tree import DecisionTreeClassifier, plot_tree

from sklearn.metrics import classification_report, confusion_matrix import

matplotlib.pyplot as plt

import numpy as np import

seaborn as sns

# Load the breast cancer dataset data =

load_breast_cancer()

X = data.data y =

data.target

# Split the data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,


random_state=42)

BCSL606 | Machine Learning Lab Page 83


# Create and train the decision tree classifier

dt_classifier = DecisionTreeClassifier(max_depth=4, random_state=42)

dt_classifier.fit(X_train, y_train)

# Make predictions on the test set

y_pred = dt_classifier.predict(X_test)

# Print model performance metrics

print("Model Performance Metrics:")

print("\nClassification Report:")

print(classification_report(y_test, y_pred, target_names=['Malignant',


'Benign']))

# Create confusion matrix visualization

plt.figure(figsize=(10, 8))

cm = confusion_matrix(y_test, y_pred)

sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')

plt.title('Confusion Matrix')

plt.ylabel('True Label')

BCSL606 | Machine Learning Lab Page 84


plt.xlabel('Predicted Label')

# Visualize the decision tree

plt.figure(figsize=(20,10))

plot_tree(dt_classifier, feature_names=data.feature_names,

class_names=['Malignant', 'Benign'], filled=True, rounded=True)

plt.title('Decision Tree Visualization')

# Function to classify a new sample

def classify_new_sample(sample, feature_names=data.feature_names):

"""

Classify a new sample using the trained decision tree model.

Parameters:

sample (list or array): List of feature values in the same order as the training
data

feature_names (list): List of feature names for reference

Returns:

tuple: (prediction, probability)

BCSL606 | Machine Learning Lab Page 85


"""

sample = np.array(sample).reshape(1, -1)

prediction = dt_classifier.predict(sample)

probability = dt_classifier.predict_proba(sample)

print("\nClassification Results:")

print(f"Prediction: {'Benign' if prediction[0] == 1 else 'Malignant'}")

print(f"Probability: Malignant: {probability[0][0]:.2f}, Benign:


{probability[0][1]:.2f}")

# Print feature importance for this prediction print("\

nTop 5 Most Important Features:")

importances = dict(zip(feature_names, dt_classifier.feature_importances_))

sorted_importances = sorted(importances.items(), key=lambda x: x[1],


reverse=True)[:5]

for feature, importance in sorted_importances:

print(f"{feature}: {importance:.4f}")

return prediction[0], probability[0]

# Example of using the classifier with a new sample

BCSL606 | Machine Learning Lab Page 86


# Using mean values from the dataset as an example

example_sample = X_train.mean(axis=0) print("\

nExample Classification:")

classify_new_sample(example_sample)

Output

BCSL606 | Machine Learning Lab Page 87


BCSL606 | Machine Learning Lab Page 88
Explanation

1. Data Preparation and Model Setup

# Loads breast cancer dataset from sklearn

# Features: Various cell nucleus measurements

# Target: Binary (Malignant/Benign)

# Splits data: 80% training, 20% testing

2. Model Configuration

 Uses DecisionTreeClassifier with:

o max_depth=4 (prevents overfitting)

o random_state=42 (reproducibility)

 Fits model using training data

3. Performance Evaluation Components:

 Classification Report shows:

o Precision: Accuracy of positive predictions

o Recall: Ability to find all positive cases

o F1-score: Balance between precision and recall

o Support: Number of samples per class

4. Visualization Elements:

 Confusion Matrix Heatmap:

BCSL606 | Machine Learning Lab Page 89


o Shows true vs predicted labels

o Blue intensity indicates number of cases

o Numbers show exact count of predictions

 Decision Tree Visualization:

o Shows complete tree structure

o max_depth=4 keeps it interpretable

o Color-coded nodes show class distribution

o Shows feature splits and thresholds

5. Sample Classification Function

def classify_new_sample(sample, feature_names):

Provides:

 Binary prediction (Malignant/Benign)

 Probability scores for each class

 Top 5 most influential features

 Feature importance scores

6. Key Features:

 Binary Classification Task

 Interpretable Model Structure

 Feature Importance Analysis

 Probability Estimates

BCSL606 | Machine Learning Lab Page 90


 Visual Decision Path

7. Use Cases:

 Medical Diagnosis Support

 Feature Importance Understanding

 Risk Assessment

 Decision Process Visualization

BCSL606 | Machine Learning Lab Page 91


Experiment-09

Develop a program to implement the Naive Bayesian classifier considering Olivetti Face
Data set for training.

Compute the accuracy of the classifier, considering a few test data sets.

Code:

import numpy as np

from sklearn.datasets import fetch_olivetti_faces

from sklearn.model_selection import train_test_split, cross_val_score from

sklearn.naive_bayes import GaussianNB

from sklearn.metrics import classification_report, confusion_matrix import

matplotlib.pyplot as plt

import seaborn as sns

# Load the Olivetti faces dataset faces

= fetch_olivetti_faces()

X = faces.data y =

faces.target

# Function to display sample faces

def display_sample_faces(X, y, num_samples=5):

"""Display sample faces from the dataset"""

BCSL606 | Machine Learning Lab Page 92


fig, axes = plt.subplots(1, num_samples, figsize=(12, 3))

for i, ax in enumerate(axes):

ax.imshow(X[i].reshape(64, 64), cmap='gray')

ax.set_title(f'Person {y[i]}')

ax.axis('off')

plt.tight_layout()

plt.show()

# Split the dataset into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,


random_state=42)

# Initialize and train the Naive Bayes classifier

nb_classifier = GaussianNB()

nb_classifier.fit(X_train, y_train)

# Make predictions

y_pred = nb_classifier.predict(X_test)

# Calculate accuracy

BCSL606 | Machine Learning Lab Page 93


accuracy = nb_classifier.score(X_test, y_test)

# Perform cross-validation

cv_scores = cross_val_score(nb_classifier, X, y, cv=5)

# Print performance metrics

print("Performance Metrics:") print(f"\

nAccuracy on test set: {accuracy:.4f}") print("\

nCross-validation scores:")

for i, score in enumerate(cv_scores, 1):

print(f"Fold {i}: {score:.4f}")

print(f"Mean CV Score: {cv_scores.mean():.4f} (+/- {cv_scores.std() * 2:.4f})")

print("\nClassification Report:")

print(classification_report(y_test, y_pred))

# Create confusion matrix visualization

plt.figure(figsize=(12, 8))

cm = confusion_matrix(y_test, y_pred)

sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')

BCSL606 | Machine Learning Lab Page 94


plt.title('Confusion Matrix')

plt.ylabel('True Label')

plt.xlabel('Predicted Label')

# Function to test the classifier on specific samples

def test_specific_samples(classifier, X_test, y_test, num_samples=5):

"""Test the classifier on specific samples and display results"""

# Randomly select samples

indices = np.random.choice(len(X_test), num_samples, replace=False)

X_samples = X_test[indices]

y_true = y_test[indices]

# Make predictions

y_pred = classifier.predict(X_samples)

probabilities = classifier.predict_proba(X_samples)

# Display results

fig, axes = plt.subplots(2, num_samples, figsize=(15, 6))

for i in range(num_samples):

# Display the face

BCSL606 | Machine Learning Lab Page 95


axes[0, i].imshow(X_samples[i].reshape(64, 64), cmap='gray')

axes[0, i].axis('off')

# Display prediction information

axes[1, i].axis('off')

prediction_text = f'True: {y_true[i]}\nPred: {y_pred[i]}\n'

prediction_text += f'Prob: {probabilities[i][y_pred[i]]:.2f}'

axes[1, i].text(0.5, 0.5, prediction_text,

ha='center', va='center')

# Add color coding for correct/incorrect predictions

if y_true[i] == y_pred[i]:

axes[0, i].set_title('Correct', color='green')

else:

axes[0, i].set_title('Incorrect', color='red')

plt.tight_layout()

plt.show()

# Display sample faces from the dataset

BCSL606 | Machine Learning Lab Page 96


print("\nDisplaying sample faces from the dataset:")

display_sample_faces(X, y)

# Test the classifier on specific samples print("\

nTesting classifier on specific samples:")

test_specific_samples(nb_classifier, X_test, y_test)

# Function to analyze misclassifications

def analyze_misclassifications(X_test, y_test, y_pred):

"""Analyze and display misclassified samples"""

misclassified = X_test[y_test != y_pred]

true_labels = y_test[y_test != y_pred]

pred_labels = y_pred[y_test != y_pred]

print(f"\nTotal misclassifications: {len(misclassified)}")

# Display some misclassified examples

num_display = min(5, len(misclassified))

if num_display> 0:

fig, axes = plt.subplots(1, num_display, figsize=(12, 3))

BCSL606 | Machine Learning Lab Page 97


for i in range(num_display):

if num_display == 1:

ax = axes

else:

ax = axes[i]

ax.imshow(misclassified[i].reshape(64, 64), cmap='gray')

ax.set_title(f'True: {true_labels[i]}\nPred: {pred_labels[i]}')

ax.axis('off')

plt.tight_layout()

plt.show()

# Analyze misclassifications print("\

nAnalyzing misclassifications:")

analyze_misclassifications(X_test, y_test, y_pred)

Output

BCSL606 | Machine Learning Lab Page 98


BCSL606 | Machine Learning Lab Page 99
BCSL606 | Machine Learning Lab Page 100
Explanation

1. Dataset and Setup

 Uses Olivetti faces dataset (400 images of 40 people)

 Each image is 64x64 pixels in grayscale

 Features are flattened pixel values

 Target is person identifier (0-39)

2. Key Functions:

a) Display Sample Faces:

def display_sample_faces(X, y, num_samples=5):

BCSL606 | Machine Learning Lab Page 101


 Shows sample faces from dataset

 Displays grayscale images with person ID

 Helps visualize input data

b) Test Specific Samples:

def test_specific_samples(classifier, X_test, y_test, num_samples=5):

 Tests classifier on random samples

 Shows both image and predictions

 Color codes correct (green) vs incorrect (red) predictions

 Displays prediction probabilities

c) Analyze Misclassifications:

def analyze_misclassifications(X_test, y_test, y_pred):

 Identifies misclassified faces

 Shows true vs predicted labels

 Helps understand where model fails

3. Model Implementation

 Uses GaussianNB (Gaussian Naive Bayes)

 Performs 80-20 train-test split

 Includes cross-validation (5 folds)

4. Performance Evaluation:

 Accuracy on test set

BCSL606 | Machine Learning Lab Page 102


 Cross-validation scores

 Detailed classification report

 Confusion matrix visualization

 Misclassification analysis

5. Visualization Components:

 Sample face display

 Confusion matrix heatmap

 Test results with probability scores

 Misclassified examples

6. Key Features:

 Face recognition capability

 Probability estimation

 Error analysis

 Visual result presentation

 Cross-validation performance

7. Notable Aspects:

 Handles high-dimensional data (4096 pixels)

 Provides probability estimates

 Visual feedback for predictions

 Comprehensive error analysis

BCSL606 | Machine Learning Lab Page 103


Experiment-10

Develop a program to implement k-means clustering using Wisconsin Breast Cancer data
set and visualizethe clustering result.

Code:

import numpy as np import

pandas as pd

from sklearn.datasets import load_breast_cancer from

sklearn.preprocessing import StandardScaler from

sklearn.cluster import KMeans

from sklearn.decomposition import PCA from

sklearn.metrics import silhouette_score

import matplotlib.pyplot as plt

import seaborn as sns

# Load the Wisconsin Breast Cancer dataset data

= load_breast_cancer()

X = data.data

y = data.target # We'll use this only for evaluation

# Create a DataFrame with feature names

df = pd.DataFrame(X, columns=data.feature_names)

BCSL606 | Machine Learning Lab Page 104


# Standardize the features

scaler = StandardScaler()

X_scaled = scaler.fit_transform(X)

# Function to determine optimal k using elbow method

def plot_elbow_curve(X, max_k=10):

inertias = []

silhouette_scores = []

k_values = range(2, max_k + 1)

for k in k_values:

kmeans = KMeans(n_clusters=k, random_state=42)

kmeans.fit(X)

inertias.append(kmeans.inertia_)

silhouette_scores.append(silhouette_score(X, kmeans.labels_))

# Plot elbow curve

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))

BCSL606 | Machine Learning Lab Page 105


# Inertia plot

ax1.plot(k_values, inertias, 'bo-')

ax1.set_xlabel('Number of Clusters (k)')

ax1.set_ylabel('Inertia')

ax1.set_title('Elbow Method')

# Silhouette score plot

ax2.plot(k_values, silhouette_scores, 'ro-')

ax2.set_xlabel('Number of Clusters (k)')

ax2.set_ylabel('Silhouette Score')

ax2.set_title('Silhouette Analysis')

plt.tight_layout()

plt.show()

return k_values[np.argmax(silhouette_scores)]

# Find optimal k

optimal_k = plot_elbow_curve(X_scaled)

print(f"\nOptimal number of clusters based on silhouette score: {optimal_k}")

BCSL606 | Machine Learning Lab Page 106


# Perform k-means clustering with optimal k

kmeans = KMeans(n_clusters=optimal_k, random_state=42)

cluster_labels = kmeans.fit_predict(X_scaled)

# Perform PCA for visualization

pca = PCA(n_components=2)

X_pca = pca.fit_transform(X_scaled)

# Create visualization of clusters

plt.figure(figsize=(12, 8))

scatter = plt.scatter(X_pca[:, 0], X_pca[:, 1], c=cluster_labels,

cmap='viridis') plt.title('K-means Clustering Results (PCA-reduced data)')

plt.xlabel('First Principal Component')

plt.ylabel('Second Principal Component')

plt.colorbar(scatter, label='Cluster')

plt.show()

# Compare clustering results with actual diagnosis

comparison_df = pd.DataFrame({

BCSL606 | Machine Learning Lab Page 107


'Cluster': cluster_labels,

'Actual_Diagnosis': y

})

print("\nCluster vs Actual Diagnosis Distribution:")

print(pd.crosstab(comparison_df['Cluster'], comparison_df['Actual_Diagnosis'],

values=np.zeros_like(cluster_labels), aggfunc='count'))

# Analyze cluster characteristics

def analyze_clusters(X, labels, feature_names):

"""Analyze and visualize characteristics of each cluster"""

# Create DataFrame with features and cluster labels

df_analysis = pd.DataFrame(X, columns=feature_names)

df_analysis['Cluster'] = labels

# Calculate mean values for each feature in each cluster

cluster_means = df_analysis.groupby('Cluster').mean()

# Create heatmap of cluster characteristics

plt.figure(figsize=(15, 8))

BCSL606 | Machine Learning Lab Page 108


sns.heatmap(cluster_means, cmap='coolwarm', center=0, annot=True,
fmt='.2f',

xticklabels=True, yticklabels=True)

plt.title('Cluster Characteristics (Feature Means)')

plt.xticks(rotation=45, ha='right')

plt.tight_layout()

plt.show()

return cluster_means

# Analyze cluster characteristics print("\

nAnalyzing cluster characteristics:")

cluster_means = analyze_clusters(X_scaled, cluster_labels, data.feature_names)

# Visualize feature importance for clustering

def plot_feature_importance(kmeans, feature_names):

"""Plot feature importance based on cluster centroids"""

# Calculate the variance of centroids for each feature

centroid_variance = np.var(kmeans.cluster_centers_, axis=0)

BCSL606 | Machine Learning Lab Page 109


# Create DataFrame for feature importance

feature_importance = pd.DataFrame({

'Feature': feature_names,

'Importance': centroid_variance

}).sort_values('Importance', ascending=False)

# Plot feature importance

plt.figure(figsize=(12, 6))

sns.barplot(x='Importance', y='Feature', data=feature_importance.head(10))

plt.title('Top 10 Most Important Features for Clustering')

plt.tight_layout()

plt.show()

return feature_importance

# Plot feature importance print("\

nAnalyzing feature importance:")

feature_importance = plot_feature_importance(kmeans, data.feature_names)

# Function to predict cluster for new samples

BCSL606 | Machine Learning Lab Page 110


def predict_cluster(sample, scaler, kmeans, feature_names):

"""Predict cluster for a new sample"""

# Ensure sample is in correct format

if isinstance(sample, list):

sample = np.array(sample).reshape(1, -1)

# Scale the sample

sample_scaled = scaler.transform(sample)

# Predict cluster

cluster = kmeans.predict(sample_scaled)[0]

# Get distances to all cluster centers

distances = kmeans.transform(sample_scaled)[0]

print(f"\nPredicted Cluster: {cluster}")

print("\nDistances to cluster centers:")

for i, dist in enumerate(distances):

print(f"Cluster {i}: {dist:.2f}")

BCSL606 | Machine Learning Lab Page 111


return cluster, distances

# Example of using the prediction function print("\

nExample prediction for a new sample:") example_sample

= X[0:1] # Using first sample as example

predicted_cluster, distances = predict_cluster(example_sample, scaler, kmeans,


data.feature_names)

Output

BCSL606 | Machine Learning Lab Page 112


BCSL606 | Machine Learning Lab Page 113
BCSL606 | Machine Learning Lab Page 114
Explanation

1. Data Preparation:

# Loads breast cancer dataset

# Standardizes features using StandardScaler

# Creates DataFrame with feature names

2. Key Functions:

a) Elbow Method Analysis:

def plot_elbow_curve(X, max_k=10):

 Determines optimal number of clusters

 Plots inertia (within-cluster sum of squares)

 Calculates silhouette scores

 Returns optimal k based on silhouette analysis

b) Cluster Analysis:

def analyze_clusters(X, labels, feature_names):

 Calculates mean values for each feature per cluster

 Creates heatmap of cluster characteristics

 Shows feature patterns in each cluster

c) Feature Importance:

def plot_feature_importance(kmeans, feature_names):

 Calculates feature importance based on centroid variance

BCSL606 | Machine Learning Lab Page 115


 Visualizes top 10 most important features

 Helps understand which features drive clustering

3. Visualization Components:

 Elbow curve and silhouette score plots

 PCA-reduced cluster visualization

 Cluster characteristics heatmap

 Feature importance bar plot

4. Model Implementation:

 Uses optimal k from silhouette analysis

 Performs clustering on standardized data

 Reduces dimensionality with PCA for visualization

 Compares clusters with actual diagnosis

5. Cluster Prediction:

def predict_cluster(sample, scaler, kmeans, feature_names):

 Predicts cluster for new samples

 Shows distances to all cluster centers

 Provides confidence measure through distances

6. Key Features:

 Automatic optimal cluster selection

 Dimensionality reduction for visualization

BCSL606 | Machine Learning Lab Page 116


 Comprehensive cluster analysis

 Feature importance ranking

 New sample prediction capability

7. Analysis Components:

 Cluster vs actual diagnosis comparison

 Cluster characteristic analysis

 Feature importance visualization

 Distance-based prediction confidence

BCSL606 | Machine Learning Lab Page 117

You might also like