0% found this document useful (0 votes)
74 views117 pages

Machine Learning (BCSL606) Lab Manual

The document outlines a Machine Learning Laboratory course (BCSL606) at Visvesvaraya Technological University, detailing various experiments focused on data analysis using the California Housing dataset. Key components include creating histograms, box plots, correlation matrices, and implementing algorithms such as k-Nearest Neighbors and Decision Trees. The document also provides code examples for exploratory data analysis and visualization techniques to understand housing data distributions and correlations.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
74 views117 pages

Machine Learning (BCSL606) Lab Manual

The document outlines a Machine Learning Laboratory course (BCSL606) at Visvesvaraya Technological University, detailing various experiments focused on data analysis using the California Housing dataset. Key components include creating histograms, box plots, correlation matrices, and implementing algorithms such as k-Nearest Neighbors and Decision Trees. The document also provides code examples for exploratory data analysis and visualization techniques to understand housing data distributions and correlations.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 117

BCSL606 | Machine Learning Lab|

Visvesvaraya Technological University (VTU)

Subject Code: BCSL606

Subject: Machine Learning Laboratory

Laboratory Components

1. Histograms and Boxplots Analysis (California Housing)

2. Correlation Matrix and Pair Plot (California Housing)

3. PCA Dimensionality Reduction (Iris Dataset)

4. Find-S Algorithm for Hypothesis Generation

5. k-Nearest Neighbors Classification (Generated Data)

6. Locally Weighted Regression Algorithm

7. Linear and Polynomial Regression (Boston Housing & Auto MPG)

8. Decision Tree Classifier (Breast Cancer Dataset)

Page 1
BCSL606 | Machine Learning Lab|

9. Naive Bayes Classifier (Olivetti Face Dataset)

10.K-Means Clustering (Breast Cancer Dataset)

Experiment-01

Develop a program to create histograms for all numerical features and analyze the
distribution of each feature. Generate box plots for all numerical features and identify
any outliers. Use California Housing dataset.

Code:

!pip install pandas numpy matplotlib seaborn scikit-learn

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

from sklearn.datasets import fetch_california_housing

Page 2
BCSL606 | Machine Learning Lab|

# Set the style for better visualization

plt.style.use('tableau-colorblind10') # Using a built-in matplotlib style

def load_and_prepare_data():

"""Load California Housing dataset and convert to pandas DataFrame"""

housing = fetch_california_housing()

df = pd.DataFrame(housing.data, columns=housing.feature_names)

df['PRICE'] = housing.target

return df

def create_distribution_plots(df, save_plots=False):

"""Create histograms and box plots for all numerical features"""

numerical_features = df.columns

# Calculate number of rows needed for subplot grid

n_features = len(numerical_features)

n_rows = (n_features + 1) // 2 # 2 plots per row

# Create histograms

plt.figure(figsize=(15, 5*n_rows))

for idx, feature in enumerate(numerical_features, 1):

plt.subplot(n_rows, 2, idx)

sns.histplot(data=df, x=feature, kde=True)

Page 3
BCSL606 | Machine Learning Lab|

plt.title(f'Distribution of {feature}')

plt.xlabel(feature)

plt.ylabel('Count')

plt.tight_layout()

if save_plots:

plt.savefig('histograms.png')

plt.show()

# Create box plots

plt.figure(figsize=(15, 5*n_rows))

for idx, feature in enumerate(numerical_features, 1):

plt.subplot(n_rows, 2, idx)

sns.boxplot(data=df[feature])

plt.title(f'Box Plot of {feature}')

plt.tight_layout()

if save_plots:

plt.savefig('boxplots.png')

plt.show()

def analyze_distributions(df):

"""Generate statistical summary and identify outliers"""

stats_summary = df.describe()

Page 4
BCSL606 | Machine Learning Lab|

# Calculate IQR and identify outliers for each feature

outlier_summary = {}

for column in df.columns:

Q1 = df[column].quantile(0.25)

Q3 = df[column].quantile(0.75)

IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR

upper_bound = Q3 + 1.5 * IQR

outliers = df[(df[column] < lower_bound) | (df[column] > upper_bound)][column]

outlier_summary[column] = {

'number_of_outliers': len(outliers),

'percentage_of_outliers': (len(outliers) / len(df)) * 100,

'outlier_range': f"< {lower_bound:.2f} or > {upper_bound:.2f}"

return stats_summary, outlier_summary

def main():

# Load the data

df = load_and_prepare_data()

# Create visualization plots

Page 5
BCSL606 | Machine Learning Lab|

create_distribution_plots(df)

# Analyze distributions and outliers

stats_summary, outlier_summary = analyze_distributions(df)

# Print statistical summary

print("\nStatistical Summary:")

print(stats_summary)

# Print outlier analysis

print("\nOutlier Analysis:")

for feature, summary in outlier_summary.items():

print(f"\n{feature}:")

print(f"Number of outliers: {summary['number_of_outliers']}")

print(f"Percentage of outliers: {summary['percentage_of_outliers']:.2f}%")

print(f"Outlier range: {summary['outlier_range']}")

if __name__ == "__main__":

main()

Output

Page 6
BCSL606 | Machine Learning Lab|

Page 7
BCSL606 | Machine Learning Lab|

Page 8
BCSL606 | Machine Learning Lab|

Page 9
BCSL606 | Machine Learning Lab|

Page 10
BCSL606 | Machine Learning Lab|

Explanation

Understanding California Housing Data Analysis

Introduction

The code performs an exploratory data analysis (EDA) on California housing data. EDA is a
crucial first step in understanding your dataset before performing any advanced analysis or
modeling. This analysis focuses on understanding the distribution of housing features and
prices across California.

Theory Behind Each Component

Data Loading and Preparation

The California Housing dataset is a standard dataset in scikit-learn containing housing prices
and related features. The data preparation step converts this into a panda DataFrame, which is
a table-like structure where:

 Each row represents a different location in California

 Each column represents a different feature (like house price, income, population)

 The target variable (house price) is added as an additional column

Distribution Analysis

The code analyzes distributions through two main approaches:

1. Visual Analysis The distribution plots help understand how values are spread across
each feature:

o Histograms show the frequency distribution of values, revealing if data is


normally distributed, skewed, or has multiple peaks

o Kernel Density Estimation (KDE) smooths the histogram to show the


continuous probability distribution

o Box plots reveal the median, quartiles, and potential outliers in the data

2. Statistical Analysis The code calculates key statistical measures:

Page 11
BCSL606 | Machine Learning Lab|

o Descriptive statistics (mean, median, standard deviation) summarize central


tendency and spread

o Interquartile Range (IQR) measures variability by finding the range between


the 25th and 75th percentiles

o Outlier detection uses the 1.5 × IQR rule: any point beyond 1.5 times the IQR
from the quartiles is considered an outlier

Visualization System

The visualization system uses matplotlib and seaborn libraries because:

 Matplotlib provides the foundation for creating plots

 Seaborn adds statistical plotting functions and improves plot aesthetics

 The tableau-colorblind10 style ensures accessibility and professional appearance

Statistical Methods Used

1. Descriptive Statistics

o Mean: Average value of each feature

o Standard deviation: Measure of data spread

o Quartiles: Values that divide data into four equal parts

o Min/Max: Range of values for each feature

2. Outlier Detection The IQR method is used because:

o It's resistant to extreme values

o Doesn't assume normal distribution

o Identifies values that are unusually high or low

o Formula: [Q1 - 1.5×IQR, Q3 + 1.5×IQR] defines the normal range

Significance of Each Feature

The dataset includes these meaningful features:

Page 12
BCSL606 | Machine Learning Lab|

 Median Income: Indicates area's economic status

 House Age: Represents property age

 Average Rooms/Bedrooms: Indicates house size

 Population and Occupancy: Shows area density

 Location (Latitude/Longitude): Captures geographical factors

 Price: Target variable showing house values

Purpose of Analysis Components

1. Distribution Plots

o Help identify patterns in data

o Show if variables are normally distributed

o Reveal potential data quality issues

o Highlight relationships between features

2. Statistical Summary

o Provides numerical understanding of data

o Helps identify unusual patterns

o Supports data-driven decisions

o Validates visual observations

3. Outlier Analysis

o Identifies unusual cases

o Helps understand extreme values

o Supports data cleaning decisions

o Reveals potential data errors

Expected Insights

Page 13
BCSL606 | Machine Learning Lab|

This analysis helps understand:

 Typical housing prices in California

 How features vary across locations

 Unusual patterns or anomalies

 Relationships between features

 Data quality and reliability

The combination of visual and statistical analysis provides a comprehensive understanding of


California's housing market characteristics, essential for further modeling or decision-making
processes.

Page 14
BCSL606 | Machine Learning Lab|

Experiment-02

Develop a program to Compute the correlation matrix to understand the relationships


between pairs of features. Visualize the correlation matrix using a heatmap to know
which variables have strong positive/negative correlations. Create a pair plot to
visualize pairwise relationships between features. Use California Housing dataset.

Code:

!pip install pandas numpy matplotlib seaborn scikit-learn

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

from sklearn.datasets import fetch_california_housing

def load_and_prepare_data():

"""Load California Housing dataset and convert to pandas DataFrame"""

housing = fetch_california_housing()

df = pd.DataFrame(housing.data, columns=housing.feature_names)

df['PRICE'] = housing.target

return df

def compute_correlation_matrix(df):

Page 15
BCSL606 | Machine Learning Lab|

"""Compute and return the correlation matrix"""

correlation_matrix = df.corr()

return correlation_matrix

def plot_correlation_heatmap(correlation_matrix):

"""Create a heatmap visualization of the correlation matrix"""

plt.figure(figsize=(12, 10))

# Create heatmap with correlation values

sns.heatmap(correlation_matrix,

annot=True, # Show correlation values

cmap='coolwarm', # Red for positive, blue for negative correlations

vmin=-1, vmax=1, # Fix the range of correlation values

center=0, # Center the colormap at 0

square=True, # Make the plot square-shaped

fmt='.2f') # Round correlation values to 2 decimal places

plt.title('Correlation Matrix Heatmap')

plt.tight_layout()

plt.show()

Page 16
BCSL606 | Machine Learning Lab|

def create_pair_plot(df):

"""Create a pair plot to show relationships between all features"""

# Create pair plot

sns.pairplot(df, diag_kind='kde', plot_kws={'alpha': 0.6})

plt.tight_layout()

plt.show()

def analyze_correlations(correlation_matrix):

"""Analyze and print notable correlations"""

# Get upper triangle of the correlation matrix

upper_tri =
correlation_matrix.where(np.triu(np.ones(correlation_matrix.shape),
k=1).astype(bool))

# Find strong correlations (absolute value > 0.5)

strong_correlations = []

for col in upper_tri.columns:

for idx, value in upper_tri[col].items():

if value is not None and abs(value) > 0.5:

strong_correlations.append({

Page 17
BCSL606 | Machine Learning Lab|

'features': (idx, col),

'correlation': value

})

# Sort by absolute correlation value

strong_correlations.sort(key=lambda x: abs(x['correlation']), reverse=True)

return strong_correlations

def main():

# Load the data

print("Loading California Housing dataset...")

df = load_and_prepare_data()

# Compute correlation matrix

print("\nComputing correlation matrix...")

correlation_matrix = compute_correlation_matrix(df)

# Plot correlation heatmap

print("\nCreating correlation heatmap...")

Page 18
BCSL606 | Machine Learning Lab|

plot_correlation_heatmap(correlation_matrix)

# Create pair plot

print("\nCreating pair plot (this may take a moment)...")

create_pair_plot(df)

# Analyze and print notable correlations

print("\nAnalyzing strong correlations...")

strong_correlations = analyze_correlations(correlation_matrix)

# Print results

print("\nStrong correlations found (|correlation| > 0.5):")

for corr in strong_correlations:

feature1, feature2 = corr['features']

correlation = corr['correlation']

correlation_type = "positive" if correlation > 0 else "negative"

print(f"{feature1} and {feature2}: {correlation:.3f} ({correlation_type}


correlation)")

if __name__ == "__main__":

main()

Page 19
BCSL606 | Machine Learning Lab|

Output

Page 20
BCSL606 | Machine Learning Lab|

Page 21
BCSL606 | Machine Learning Lab|

Explanation

This code analyzes the California Housing dataset to understand how different
features in houses are related to each other.

The main purpose is to find correlations between different housing features. A


correlation shows how strongly two features are related. For example, it can tell
us if house prices tend to go up when the number of rooms increases.

Correlation values range from -1 to +1:

 +1 means perfect positive correlation (when one goes up, the other goes
up)

 0 means no correlation (no relationship)

 -1 means perfect negative correlation (when one goes up, the other goes
down)

The code creates two main visualizations:

1. A Correlation Heatmap:

 Shows all correlations in a color-coded matrix

 Red colors show positive correlations

 Blue colors show negative correlations

 Darker colors mean stronger relationships

 Numbers in each cell show the exact correlation value

2. A Pair Plot:

 Shows scatter plots for every pair of features

 Helps visualize relationships between variables

Page 22
BCSL606 | Machine Learning Lab|

 Shows distribution of each feature on the diagonal

The code also automatically finds strong correlations (values above 0.5 or
below -0.5) and prints them, telling you which features are strongly related and
whether the relationship is positive or negative.

This analysis helps understand patterns in the housing market, like:

 Which features most strongly affect house prices

 Which features tend to occur together

 Whether features have expected or surprising relationships

1. Function: load_and_prepare_data()

o Purpose: Loads California Housing dataset

o Steps:

 Fetches data using sklearn's fetch_california_housing()

 Converts to pandas DataFrame

 Adds house prices as a target column

 Returns complete dataset

2. Function: compute_correlation_matrix(df)

o Purpose: Calculates correlations between all features

o Uses pandas' df.corr() to compute Pearson correlation coefficients

o Returns a matrix where values range from -1 to 1

 1: Perfect positive correlation

 0: No correlation

Page 23
BCSL606 | Machine Learning Lab|

 -1: Perfect negative correlation

3. Function: plot_correlation_heatmap(correlation_matrix)

o Purpose: Creates visual heatmap of correlations

o Settings:

 Figure size: 12x10

 Shows actual correlation values (annot=True)

 Uses coolwarm color scheme (red=positive, blue=negative)

 Range: -1 to 1

 Formats numbers to 2 decimal places

4. Function: create_pair_plot(df)

o Purpose: Shows relationships between all pairs of features

o Uses seaborn's pairplot

o Settings:

 Diagonal: Kernel Density Estimation (kde)

 Alpha: 0.6 for transparency

 Shows scatter plots for all feature combinations

5. Function: analyze_correlations(correlation_matrix)

o Purpose: Identifies strong correlations

o Steps:

 Gets upper triangle of correlation matrix

Page 24
BCSL606 | Machine Learning Lab|

 Finds correlations stronger than ±0.5

 Sorts results by correlation strength

 Returns list of strong correlations

6. Function: main()

o Purpose: Orchestrates the analysis workflow

o Process:

1. Loads housing data

2. Computes correlation matrix

3. Creates heatmap visualization

4. Generates pair plot

5. Analyzes strong correlations

6. Prints findings

7. Output Format

o Visual outputs:

 Correlation heatmap

 Pair plot matrix

o Text output:

 Lists strong correlations

 Shows correlation strength

 Indicates if correlation is positive/negative

Page 25
BCSL606 | Machine Learning Lab|

Experiment-03

Develop a program to implement Principal Component Analysis (PCA) for


reducing the dimensionality of the Iris dataset from 4 features to 2

Code:

!pip install pandas numpy matplotlib scikit-learn

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

from sklearn.datasets import load_iris

from sklearn.preprocessing import StandardScaler

from sklearn.decomposition import PCA

def load_and_prepare_data():

"""Load Iris dataset and prepare it for PCA"""

# Load the iris dataset

iris = load_iris()

# Create a DataFrame with feature names

df = pd.DataFrame(data=iris.data, columns=iris.feature_names)

Page 26
BCSL606 | Machine Learning Lab|

# Add target variable

df['target'] = iris.target

df['target_names'] = pd.Categorical.from_codes(iris.target, iris.target_names)

return df, iris.feature_names

def perform_pca(data, feature_names):

"""Perform PCA on the dataset"""

# Separate features

X = data[feature_names]

# Standardize the features

scaler = StandardScaler()

X_scaled = scaler.fit_transform(X)

# Apply PCA

pca = PCA(n_components=2)

X_pca = pca.fit_transform(X_scaled)

# Calculate explained variance ratio

Page 27
BCSL606 | Machine Learning Lab|

explained_variance_ratio = pca.explained_variance_ratio_

# Get component loadings

loadings = pca.components_

return X_pca, explained_variance_ratio, loadings, pca

def plot_pca_results(X_pca, data, explained_variance_ratio):

"""Plot the PCA results"""

# Create figure

plt.figure(figsize=(10, 8))

# Create scatter plot for each class

targets = sorted(data['target'].unique())

target_names = sorted(data['target_names'].unique())

for target, target_name in zip(targets, target_names):

mask = data['target'] == target

plt.scatter(X_pca[mask, 0], X_pca[mask, 1],

label=target_name, alpha=0.8)

Page 28
BCSL606 | Machine Learning Lab|

# Add labels and title

plt.xlabel(f'First Principal Component (Explains


{explained_variance_ratio[0]:.2%} of variance)')

plt.ylabel(f'Second Principal Component (Explains


{explained_variance_ratio[1]:.2%} of variance)')

plt.title('PCA of Iris Dataset')

plt.legend()

plt.grid(True, alpha=0.3)

plt.show()

def plot_explained_variance(pca):

"""Plot cumulative explained variance ratio"""

plt.figure(figsize=(10, 6))

cumsum = np.cumsum(pca.explained_variance_ratio_)

plt.plot(range(1, len(cumsum) + 1), cumsum, 'bo-')

plt.xlabel('Number of Components')

plt.ylabel('Cumulative Explained Variance Ratio')

plt.title('Explained Variance vs. Number of Components')

plt.grid(True, alpha=0.3)

Page 29
BCSL606 | Machine Learning Lab|

plt.show()

def visualize_feature_importance(loadings, feature_names):

"""Visualize feature importance in each principal component"""

plt.figure(figsize=(12, 6))

# Plot for PC1

plt.subplot(1, 2, 1)

plt.bar(feature_names, loadings[0])

plt.title('Feature Weights in First Principal Component')

plt.xticks(rotation=45)

# Plot for PC2

plt.subplot(1, 2, 2)

plt.bar(feature_names, loadings[1])

plt.title('Feature Weights in Second Principal Component')

plt.xticks(rotation=45)

plt.tight_layout()

plt.show()

Page 30
BCSL606 | Machine Learning Lab|

def main():

# Load and prepare data

print("Loading Iris dataset...")

data, feature_names = load_and_prepare_data()

# Perform PCA

print("\nPerforming PCA...")

X_pca, explained_variance_ratio, loadings, pca = perform_pca(data,


feature_names)

# Print explained variance

print("\nExplained Variance Ratio:")

print(f"PC1: {explained_variance_ratio[0]:.2%}")

print(f"PC2: {explained_variance_ratio[1]:.2%}")

print(f"Total: {sum(explained_variance_ratio):.2%}")

# Plot results

print("\nCreating visualizations...")

plot_pca_results(X_pca, data, explained_variance_ratio)

plot_explained_variance(pca)

Page 31
BCSL606 | Machine Learning Lab|

visualize_feature_importance(loadings, feature_names)

# Print feature importance

print("\nFeature Weights in Principal Components:")

for i, component in enumerate(loadings):

print(f"\nPrincipal Component {i+1}:")

for fname, weight in zip(feature_names, component):

print(f"{fname}: {weight:.3f}")

if __name__ == "__main__":

main()

Output

Page 32
BCSL606 | Machine Learning Lab|

Page 33
BCSL606 | Machine Learning Lab|

Explanation

Basic Theory:

PCA is a technique that reduces the dimensionality of data while preserving as


much important information as possible. It transforms high-dimensional data
into a new set of features called principal components.

Code Functions:

1. load_and_prepare_data()

o Loads the famous Iris dataset (contains measurements of different


iris flowers)
Page 34
BCSL606 | Machine Learning Lab|

o Creates a DataFrame with flower measurements and their species


names

o Each row represents one flower with its features and species type

2. perform_pca()

o Standardizes the data (makes all features have same scale)

o Applies PCA to reduce data to 2 dimensions

o Returns:

 Transformed data

 How much information each component preserves

 Feature weights in each component

3. plot_pca_results()

o Creates a scatter plot showing flowers in the new 2D space

o Different colors for different iris species

o Shows how well species are separated after PCA

o Labels show how much variance each component explains

4. plot_explained_variance()

o Shows how much total information is preserved as we add


components

o Helps decide how many components to keep

Page 35
BCSL606 | Machine Learning Lab|

5. visualize_feature_importance()

o Creates bar plots showing which original features contribute most


to each principal component

o Helps understand what each new component means

What the Code Does:

1. Takes 4-dimensional iris flower measurements

2. Reduces them to 2 dimensions while keeping most important patterns

3. Shows how well different iris species can be distinguished

4. Tells us which original measurements are most important

Why This is Useful:

 Helps visualize high-dimensional data

 Finds most important patterns in the data

 Shows which original features matter most

 Can help classify different types of iris flowers using fewer


measurements

Page 36
BCSL606 | Machine Learning Lab|

Experiment-04

For a given set of training data examples stored in a .CSV file, implement and
demonstrate the Find-S algorithm to output a description of the set of all hypotheses
consistent with the training examples

Code:

import pandas as pd

import numpy as np

class FindS:

def __init__(self):

self.hypothesis = None

self.features = None

def initialize_hypothesis(self, num_features):

"""Initialize the most specific hypothesis"""

return ['ϕ'] * num_features

def is_positive_example(self, target):

"""Check if the example is positive"""

return target == 'Yes'

Page 37
BCSL606 | Machine Learning Lab|

def generalize_hypothesis(self, example, current_hypothesis):

"""

Generalize the hypothesis to be consistent with the positive example

"""

new_hypothesis = []

for ex_val, hyp_val in zip(example, current_hypothesis):

# If hypothesis value is 'ϕ' (null), use the example value

if hyp_val == 'ϕ':

new_hypothesis.append(ex_val)

# If values match, keep the value

elif ex_val == hyp_val:

new_hypothesis.append(hyp_val)

# If values don't match, generalize to '?'

else:

new_hypothesis.append('?')

return new_hypothesis

def fit(self, data, target_column):

Page 38
BCSL606 | Machine Learning Lab|

"""

Find the most specific hypothesis consistent with the training examples

Parameters:

data: pandas DataFrame containing the training examples

target_column: name of the target column

"""

# Separate features and target

X = data.drop(columns=[target_column])

y = data[target_column]

# Store feature names

self.features = X.columns.tolist()

# Initialize hypothesis

self.hypothesis = self.initialize_hypothesis(len(self.features))

# Process each training example

for index, row in X.iterrows():

# Only consider positive examples

Page 39
BCSL606 | Machine Learning Lab|

if self.is_positive_example(y[index]):

self.hypothesis = self.generalize_hypothesis(

row.values.tolist(),

self.hypothesis

return self.hypothesis

def print_hypothesis(self):

"""Print the current hypothesis in a readable format"""

if self.hypothesis and self.features:

print("\nFinal Hypothesis:")

print("〈", end='')

for feature, value in zip(self.features, self.hypothesis):

print(f"{feature} = {value}, ", end='')

print("〉")

else:

print("No hypothesis found. Please run fit() first.")

def load_data(filename):

Page 40
BCSL606 | Machine Learning Lab|

"""Load data from CSV file"""

try:

return pd.read_csv(filename)

except FileNotFoundError:

print(f"Error: File '{filename}' not found.")

return None

except Exception as e:

print(f"Error loading data: {str(e)}")

return None

def main():

# Example usage with sample data

print("Creating sample training data...")

# Create sample data if no file is provided

sample_data = {

'Sky': ['Sunny', 'Sunny', 'Rainy', 'Sunny'],

'Temperature': ['Warm', 'Warm', 'Cold', 'Warm'],

'Humidity': ['High', 'High', 'High', 'High'],

'Wind': ['Weak', 'Strong', 'Weak', 'Weak'],

Page 41
BCSL606 | Machine Learning Lab|

'PlayTennis': ['Yes', 'Yes', 'No', 'Yes']

df = pd.DataFrame(sample_data)

print("\nTraining Data:")

print(df)

# Initialize and run Find-S algorithm

print("\nRunning Find-S algorithm...")

find_s = FindS()

find_s.fit(df, target_column='PlayTennis')

# Print results

find_s.print_hypothesis()

print("\nHypothesis Interpretation:")

print("- '?' means any value is acceptable for that attribute")

print("- 'ϕ' means no value has been observed (null)")

print("- Specific values indicate required values for that attribute")

Page 42
BCSL606 | Machine Learning Lab|

if __name__ == "__main__":

main()

Output

Explanation

Key Concepts of Find-S Algorithm:

1. Purpose

 Find-S aims to find the most specific hypothesis that is consistent with
training examples

 It particularly focuses on positive training examples while ignoring


negative ones

 The algorithm tries to identify essential patterns in features that lead to


positive outcomes
Page 43
BCSL606 | Machine Learning Lab|

2. Hypothesis Space

 Starts with the most specific hypothesis possible (null values)

 Gradually generalizes this hypothesis as it processes positive examples

 Uses three types of values in hypothesis:

o Specific values (required conditions)

o '?' (any value allowed)

o 'ϕ' (null/initial state)

3. Working Principle

 Only processes positive examples in the training data

 When a positive example is encountered, compares each attribute with


current hypothesis

 Generalizes hypothesis only when necessary to accommodate new


positive examples

 Never becomes more specific once generalized

4. Generalization Rules

 If attribute matches current hypothesis: Keep current value

 If current hypothesis is null (ϕ): Use the example's value

 If mismatch occurs: Generalize to '?' (any value acceptable)

5. Advantages

 Simple to understand and implement

 Computationally efficient

Page 44
BCSL606 | Machine Learning Lab|

 Works well with consistent data

 Provides clear, interpretable results

6. Limitations

 Ignores negative examples completely

 Cannot handle inconsistent training data

 May not find the most general hypothesis

 Assumes noise-free training data

7. Applications

 Concept learning problems

 Pattern recognition

 Simple classification tasks

 Educational purposes to understand basic machine learning concepts

8. Example Scenario

 Consider learning when to play tennis based on weather conditions

 Features might include sky condition, temperature, humidity, wind

 Algorithm learns which conditions must be present for playing tennis

 Gradually generalizes conditions that aren't strictly necessary

Page 45
BCSL606 | Machine Learning Lab|

Experiment-05

Develop a program to implement k-Nearest Neighbour algorithm to classify the


randomly generated 100 values of x in the range of [0,1]. Perform the following based
on dataset generated.

a. Label the first 50 points {x1,……,x50} as follows: if (xi ≤ 0.5), then xi ∊ Class1,
else xi ∊ Class1
b. Classify the remaining points, x51,……,x100 using KNN. Perform this for
k=1,2,3,4,5,20,30

Code:

import numpy as np

import matplotlib.pyplot as plt

from collections import Counter

class KNN:

def __init__(self, k):

self.k = k

self.X_train = None

self.y_train = None

def fit(self, X, y):

"""Store training data"""

self.X_train = X

self.y_train = y

Page 46
BCSL606 | Machine Learning Lab|

def predict(self, X):

"""Predict class for each input value"""

predictions = []

for x in X:

# Calculate distances to all training points

distances = np.abs(self.X_train - x)

# Get indices of k nearest neighbors

k_nearest_indices = np.argsort(distances)[:self.k]

# Get classes of k nearest neighbors

k_nearest_labels = self.y_train[k_nearest_indices]

# Perform majority voting

most_common = Counter(k_nearest_labels).most_common(1)

predictions.append(most_common[0][0])

return np.array(predictions)

Page 47
BCSL606 | Machine Learning Lab|

def generate_data():

"""Generate and label the dataset"""

# Generate 100 random points in [0,1]

np.random.seed(42) # For reproducibility

X = np.random.rand(100)

# Label first 50 points

y = np.zeros(100)

y[:50] = np.where(X[:50] <= 0.5, 1, 2)

return X, y

def plot_results(X_train, y_train, X_test, y_pred, k):

"""Plot the results for a given k value"""

plt.figure(figsize=(12, 4))

# Plot training data

plt.scatter(X_train[y_train == 1], np.zeros_like(X_train[y_train == 1]),

c='blue', label='Class 1 (Training)', marker='o')

Page 48
BCSL606 | Machine Learning Lab|

plt.scatter(X_train[y_train == 2], np.zeros_like(X_train[y_train == 2]),

c='red', label='Class 2 (Training)', marker='o')

# Plot test data predictions

plt.scatter(X_test[y_pred == 1], np.ones_like(X_test[y_pred == 1])*0.1,

c='lightblue', label='Class 1 (Predicted)', marker='^')

plt.scatter(X_test[y_pred == 2], np.ones_like(X_test[y_pred == 2])*0.1,

c='lightcoral', label='Class 2 (Predicted)', marker='^')

plt.title(f'KNN Classification Results (k={k})')

plt.xlabel('x')

plt.yticks([])

plt.legend()

plt.grid(True, alpha=0.3)

plt.show()

def analyze_boundary_points(X_test, y_pred, k):

"""Analyze and print details about boundary points"""

boundary_points = []

Page 49
BCSL606 | Machine Learning Lab|

# Find points where predictions change

for i in range(1, len(y_pred)):

if y_pred[i] != y_pred[i-1]:

boundary_points.append(X_test[i])

if boundary_points:

print(f"\nDecision boundaries for k={k}:")

for point in sorted(boundary_points):

print(f"x = {point:.3f}")

else:

print(f"\nNo clear decision boundaries found for k={k}")

def main():

# Generate data

print("Generating dataset...")

X, y = generate_data()

# Split into training and test sets

X_train, y_train = X[:50], y[:50]

X_test, y_test = X[50:], y[50:]

Page 50
BCSL606 | Machine Learning Lab|

# Sort test data for better visualization

sort_idx = np.argsort(X_test)

X_test = X_test[sort_idx]

# Try different k values

k_values = [1, 2, 3, 4, 5, 20, 30]

for k in k_values:

print(f"\nPerforming classification with k={k}")

# Create and train KNN classifier

knn = KNN(k=k)

knn.fit(X_train, y_train)

# Make predictions

y_pred = knn.predict(X_test)

# Plot results

plot_results(X_train, y_train, X_test, y_pred, k)

Page 51
BCSL606 | Machine Learning Lab|

# Analyze decision boundaries

analyze_boundary_points(X_test, y_pred, k)

# Calculate and print summary statistics

class1_pred = np.sum(y_pred == 1)

class2_pred = np.sum(y_pred == 2)

print(f"\nPrediction Summary for k={k}:")

print(f"Class 1: {class1_pred} points ({class1_pred/len(y_pred)*100:.1f}


%)")

print(f"Class 2: {class2_pred} points ({class2_pred/len(y_pred)*100:.1f}


%)")

if __name__ == "__main__":

main()

Page 52
BCSL606 | Machine Learning Lab|

Output

Page 53
BCSL606 | Machine Learning Lab|

Page 54
BCSL606 | Machine Learning Lab|

Page 55
BCSL606 | Machine Learning Lab|

Explanation

1. Core KNN Implementation

 The KNN class implements the K-Nearest Neighbors algorithm with two
main methods:

o fit: Stores training data and labels

o predict: Makes predictions by finding k nearest neighbors and


using majority voting

 The algorithm uses absolute distance (np.abs) to measure proximity


between points

 For each test point, it finds k closest training points and takes a majority
vote

2. Data Generation

 Creates a synthetic dataset with 100 random points in range [0,1]

 First 50 points are labeled based on a simple rule:

Page 56
BCSL606 | Machine Learning Lab|

o Points ≤ 0.5 get label 1

o Points > 0.5 get label 2

 Data is split into training (first 50 points) and testing (remaining 50


points)

3. Visualization Components

 plot_results function creates visual representation showing:

o Training data points (blue for class 1, red for class 2)

o Predicted classifications (light blue/coral triangles)

o Clear legend and grid for better readability

o Uses different markers for training (circles) vs predictions


(triangles)

4. Decision Boundary Analysis

 analyze_boundary_points function:

o Identifies points where predictions change from one class to


another

o Prints the x-coordinates of these boundary points

o Helps understand where the algorithm switches between classes

5. Main Execution Flow

 Tests multiple k values: [1, 2, 3, 4, 5, 20, 30]

 For each k value:

o Creates and trains KNN classifier

Page 57
BCSL606 | Machine Learning Lab|

o Makes predictions on test data

o Visualizes results

o Analyzes decision boundaries

o Prints summary statistics (percentage of each class)

6. Key Features

 Uses numpy for efficient numerical computations

 Implements Counter for majority voting

 Includes comprehensive visualization

 Provides detailed analysis of classification boundaries

 Shows impact of different k values on predictions

7. Insights from Implementation

 Smaller k values lead to more complex decision boundaries

 Larger k values create smoother, more generalized boundaries

 The choice of k significantly impacts classification results

 Visualization helps understand algorithm behavior

Page 58
BCSL606 | Machine Learning Lab|

Experiment-06

Implement the non-parametric Locally Weighted Regression algorithm in order to fit


data points. Select appropriate data set for your experiment and draw graphs

Code:

import numpy as np

import matplotlib.pyplot as plt

from sklearn.datasets import make_regression

def generate_sample_data(n_samples=100, noise=10):

"""Generate sample data with non-linear pattern"""

X = np.linspace(0, 10, n_samples)

y = 2 * np.sin(X) + X/2 + np.random.normal(0, noise/10, n_samples)

return X, y

def kernel(x, x_i, tau=0.5):

"""Gaussian kernel function for weight calculation"""

return np.exp(-(x - x_i)**2 / (2 * tau**2))

def lowess(X, y, x_pred, tau=0.5):

"""

Locally Weighted Regression implementation

Page 59
BCSL606 | Machine Learning Lab|

Parameters:

-----------

X : array-like

Training input features

y : array-like

Target values

x_pred : array-like

Points at which to make predictions

tau : float

Bandwidth parameter controlling smoothness

Returns:

--------

array-like

Predicted values at x_pred points

"""

# Ensure arrays are 1D

X = np.ravel(X)

y = np.ravel(y)

Page 60
BCSL606 | Machine Learning Lab|

x_pred = np.ravel(x_pred)

y_pred = []

for x in x_pred:

# Calculate weights for all points

weights = kernel(x, X, tau)

# Weighted least squares matrices

W = np.diag(weights)

X_aug = np.column_stack([np.ones_like(X), X]) # Add bias term

# Calculate weighted least squares parameters

theta = np.linalg.inv(X_aug.T @ W @ X_aug) @ X_aug.T @ W @ y

# Make prediction

x_aug = np.array([1, x])

y_pred.append(float(x_aug @ theta))

return np.array(y_pred)

Page 61
BCSL606 | Machine Learning Lab|

# Generate sample data

np.random.seed(42)

X, y = generate_sample_data(n_samples=100, noise=10)

# Generate points for prediction

X_pred = np.linspace(0, 10, 200)

# Fit LOWESS with different bandwidth parameters

y_pred_smooth = lowess(X, y, X_pred, tau=0.3) # More local fitting

y_pred_medium = lowess(X, y, X_pred, tau=0.8) # Medium smoothing

y_pred_rough = lowess(X, y, X_pred, tau=2.0) # More global fitting

# Plotting

plt.figure(figsize=(12, 6))

plt.scatter(X, y, color='blue', alpha=0.5, label='Data points')

plt.plot(X_pred, y_pred_smooth, 'r-', label='τ = 0.3 (More local)', linewidth=2)

plt.plot(X_pred, y_pred_medium, 'g-', label='τ = 0.8 (Medium)', linewidth=2)

plt.plot(X_pred, y_pred_rough, 'y-', label='τ = 2.0 (More global)', linewidth=2)

plt.xlabel('X')

Page 62
BCSL606 | Machine Learning Lab|

plt.ylabel('y')

plt.title('Locally Weighted Regression with Different Bandwidth Parameters')

plt.legend()

plt.grid(True)

plt.show()

Output

Explanation

1. Data Generation

def generate_sample_data(n_samples=100, noise=10):

 Creates non-linear sample data using sine function

 Adds random noise to make it more realistic

 Pattern is: y = 2 * sin(x) + x/2 + noise


Page 63
BCSL606 | Machine Learning Lab|

2. Kernel Function

def kernel(x, x_i, tau=0.5):

return np.exp(-(x - x_i)**2 / (2 * tau**2))

 Implements Gaussian kernel for weight calculation

 Gives higher weights to nearby points

 tau (bandwidth) controls how quickly weight decreases with distance

 Smaller tau = more local fitting

 Larger tau = more global smoothing

3. LOWESS Implementation

def lowess(X, y, x_pred, tau=0.5):

Key steps:

 For each prediction point:

o Calculate weights for all training points using kernel function

o Create weight matrix (W) and augmented feature matrix (X_aug)

o Solve weighted least squares: θ = (X^T W X)^(-1) X^T W y

o Make prediction using calculated parameters

Page 64
BCSL606 | Machine Learning Lab|

4. Visualization Setup

 Generates 100 sample points with noise

 Creates 200 evenly spaced points for prediction curve

 Tests three different bandwidth (tau) values:

o τ = 0.3: More local fitting (follows data closely)

o τ = 0.8: Medium smoothing

o τ = 2.0: More global fitting (smoother curve)

5. Key Characteristics of LOWESS

 Non-parametric regression technique

 Adapts to local structure of data

 Bandwidth parameter controls smoothness:

o Small tau: More flexible, might overfit

o Large tau: Smoother, might underfit

 Computationally intensive (calculates weights for each prediction)

6. Main Differences in Results

 Red line (τ = 0.3): Follows local variations closely

 Green line (τ = 0.8): Balanced between local and global

 Yellow line (τ = 2.0): Shows general trend, ignores local variations

Page 65
BCSL606 | Machine Learning Lab|

7. Advantages and Disadvantages Advantages:

 No assumption about global function shape

 Handles non-linear relationships well

 Flexible local fitting

Disadvantages:

 Computationally expensive

 Sensitive to bandwidth parameter

 Can perform poorly at boundaries

Page 66
BCSL606 | Machine Learning Lab|

Experiment-07

Develop a program to demonstrate the working of Linear Regression and Polynomial


Regression. Use Boston Housing Dataset for Linear Regression and Auto MPG Dataset
(for vehicle fuel efficiency prediction) for Polynomial Regression.

Code:

import pandas as pd

# Load Boston Housing dataset

url =
"https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv
"

boston_df = pd.read_csv(url)

# Print column names

print("Available columns in the dataset:")

print(boston_df.columns.tolist())

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression

Page 67
BCSL606 | Machine Learning Lab|

from sklearn.preprocessing import PolynomialFeatures, StandardScaler

from sklearn.metrics import mean_squared_error, r2_score

import warnings

warnings.filterwarnings('ignore')

# Part 1: Linear Regression with Boston Housing Dataset

print("Part 1: Linear Regression - Boston Housing Dataset")

print("-" * 50)

# Load Boston Housing dataset

url =
"https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv
"

boston_df = pd.read_csv(url)

# Features and target (using correct column names)

X_boston = boston_df.drop('medv', axis=1) # All columns except target

y_boston = boston_df['medv'] # median house value

# Print dataset info

print("\nDataset Information:")

Page 68
BCSL606 | Machine Learning Lab|

print(f"Number of samples: {len(X_boston)}")

print(f"Number of features: {len(X_boston.columns)}")

print("\nFeatures:")

for name in X_boston.columns:

print(f"- {name}")

# Split the data

X_train_boston, X_test_boston, y_train_boston, y_test_boston =


train_test_split(

X_boston, y_boston, test_size=0.2, random_state=42

# Scale the features

scaler = StandardScaler()

X_train_boston_scaled = scaler.fit_transform(X_train_boston)

X_test_boston_scaled = scaler.transform(X_test_boston)

# Train Linear Regression model

lr_model = LinearRegression()

lr_model.fit(X_train_boston_scaled, y_train_boston)

Page 69
BCSL606 | Machine Learning Lab|

# Make predictions

y_pred_boston = lr_model.predict(X_test_boston_scaled)

# Calculate metrics

mse_boston = mean_squared_error(y_test_boston, y_pred_boston)

rmse_boston = np.sqrt(mse_boston)

r2_boston = r2_score(y_test_boston, y_pred_boston)

print("\nLinear Regression Results:")

print(f"Mean Squared Error: {mse_boston:.2f}")

print(f"Root Mean Squared Error: {rmse_boston:.2f}")

print(f"R² Score: {r2_boston:.2f}")

# Feature importance analysis

feature_importance = pd.DataFrame({

'Feature': X_boston.columns,

'Coefficient': lr_model.coef_

})

feature_importance['Abs_Coefficient'] = abs(feature_importance['Coefficient'])

feature_importance = feature_importance.sort_values('Abs_Coefficient',
ascending=False)

Page 70
BCSL606 | Machine Learning Lab|

print("\nFeature Importance:")

print(feature_importance[['Feature', 'Coefficient']].to_string(index=False))

# Visualize feature importance

plt.figure(figsize=(12, 6))

plt.bar(feature_importance['Feature'], feature_importance['Coefficient'])

plt.xticks(rotation=45)

plt.title('Feature Importance in Boston Housing Price Prediction')

plt.xlabel('Features')

plt.ylabel('Coefficient Value')

plt.tight_layout()

plt.show()

# Plot actual vs predicted values

plt.figure(figsize=(10, 6))

plt.scatter(y_test_boston, y_pred_boston, alpha=0.5)

plt.plot([y_test_boston.min(), y_test_boston.max()], [y_test_boston.min(),


y_test_boston.max()], 'r--', lw=2)

plt.xlabel('Actual Prices ($1000s)')

plt.ylabel('Predicted Prices ($1000s)')

Page 71
BCSL606 | Machine Learning Lab|

plt.title('Actual vs Predicted Housing Prices')

plt.tight_layout()

plt.show()

# Part 2: Polynomial Regression with Auto MPG Dataset

print("\nPart 2: Polynomial Regression - Auto MPG Dataset")

print("-" * 50)

# Load Auto MPG dataset

url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-
mpg.data'

column_names = ['MPG', 'Cylinders', 'Displacement', 'Horsepower', 'Weight',

'Acceleration', 'Model Year', 'Origin', 'Car Name']

df = pd.read_csv(url, names=column_names, delim_whitespace=True)

# Clean the data

df = df.replace('?', np.nan)

df = df.dropna()

df['Horsepower'] = df['Horsepower'].astype(float)

# Select features for polynomial regression

Page 72
BCSL606 | Machine Learning Lab|

X_mpg = df[['Horsepower']].values

y_mpg = df['MPG'].values

# Scale features for polynomial regression

scaler_mpg = StandardScaler()

X_mpg_scaled = scaler_mpg.fit_transform(X_mpg)

# Split the data

X_train_mpg, X_test_mpg, y_train_mpg, y_test_mpg = train_test_split(

X_mpg_scaled, y_mpg, test_size=0.2, random_state=42

# Create and train models with different polynomial degrees

degrees = [1, 2, 3]

plt.figure(figsize=(15, 5))

for i, degree in enumerate(degrees, 1):

# Create polynomial features

poly_features = PolynomialFeatures(degree=degree)

X_train_poly = poly_features.fit_transform(X_train_mpg)

Page 73
BCSL606 | Machine Learning Lab|

X_test_poly = poly_features.transform(X_test_mpg)

# Train model

poly_model = LinearRegression()

poly_model.fit(X_train_poly, y_train_mpg)

# Make predictions

y_pred_poly = poly_model.predict(X_test_poly)

# Calculate metrics

mse_poly = mean_squared_error(y_test_mpg, y_pred_poly)

rmse_poly = np.sqrt(mse_poly)

r2_poly = r2_score(y_test_mpg, y_pred_poly)

print(f"\nPolynomial Regression (degree {degree}) Results:")

print(f"Mean Squared Error: {mse_poly:.2f}")

print(f"Root Mean Squared Error: {rmse_poly:.2f}")

print(f"R² Score: {r2_poly:.2f}")

# Plot results

Page 74
BCSL606 | Machine Learning Lab|

plt.subplot(1, 3, i)

plt.scatter(X_test_mpg, y_test_mpg, color='blue', alpha=0.5, label='Actual')

# Sort points for smooth curve

X_sort = np.sort(X_test_mpg, axis=0)

X_sort_poly = poly_features.transform(X_sort)

y_sort_pred = poly_model.predict(X_sort_poly)

plt.plot(X_sort, y_sort_pred, color='red', label='Predicted')

plt.xlabel('Horsepower (scaled)')

plt.ylabel('MPG')

plt.title(f'Polynomial Regression (degree {degree})')

plt.legend()

plt.tight_layout()

plt.show()

Page 75
BCSL606 | Machine Learning Lab|

Output

Page 76
BCSL606 | Machine Learning Lab|

Page 77
BCSL606 | Machine Learning Lab|

Explanation

1. Part 1: Linear Regression with Boston Housing Dataset

Key Components:

 Uses the Boston Housing dataset to predict house prices

 Features include various neighborhood characteristics

 Target variable is 'medv' (median house value)

Page 78
BCSL606 | Machine Learning Lab|

Implementation Steps:

# Data Preparation

- Loads dataset from URL

- Splits features (X) and target (y)

- Uses train_test_split for data division

- Applies StandardScaler for feature normalization

# Model Training

- Creates LinearRegression model

- Fits model on scaled training data

- Makes predictions on test set

# Evaluation

- Calculates MSE, RMSE, and R² metrics

- Analyzes feature importance through coefficients

- Visualizes feature importance with bar plot

- Creates actual vs predicted scatter plot

2. Part 2: Polynomial Regression with Auto MPG Dataset

Key Components:

 Uses Auto MPG dataset to predict fuel efficiency

Page 79
BCSL606 | Machine Learning Lab|

 Focuses on Horsepower as main feature

 Tests three polynomial degrees (1, 2, 3)

Implementation Steps:

# Data Preparation

- Loads and cleans MPG dataset

- Handles missing values ('?')

- Scales features using StandardScaler

# Model Training

- Creates polynomial features for each degree

- Trains separate models for each degree

- Makes predictions using each model

# Evaluation

- Calculates metrics for each polynomial degree

- Creates subplots showing fit for each degree

- Compares performance across degrees

3. Key Visualizations:

 Feature importance bar chart for Boston Housing

 Actual vs predicted scatter plot for house prices

Page 80
BCSL606 | Machine Learning Lab|

 Three subplots showing polynomial fits of different degrees

4. Important Metrics Tracked:

 Mean Squared Error (MSE)

 Root Mean Squared Error (RMSE)

 R² Score (coefficient of determination)

5. Key Insights:

 Shows how feature scaling improves model performance

 Demonstrates overfitting risk with higher polynomial degrees

 Illustrates importance of different features in housing prices

Page 81
BCSL606 | Machine Learning Lab|

Experiment-08

Develop a program to demonstrate the working of the decision tree algorithm. Use
Breast Cancer Data set for building the decision tree and apply this knowledge to
classify a new sample.

Code:

from sklearn.datasets import load_breast_cancer

from sklearn.model_selection import train_test_split

from sklearn.tree import DecisionTreeClassifier, plot_tree

from sklearn.metrics import classification_report, confusion_matrix

import matplotlib.pyplot as plt

import numpy as np

import seaborn as sns

# Load the breast cancer dataset

data = load_breast_cancer()

X = data.data

y = data.target

# Split the data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,


random_state=42)

Page 82
BCSL606 | Machine Learning Lab|

# Create and train the decision tree classifier

dt_classifier = DecisionTreeClassifier(max_depth=4, random_state=42)

dt_classifier.fit(X_train, y_train)

# Make predictions on the test set

y_pred = dt_classifier.predict(X_test)

# Print model performance metrics

print("Model Performance Metrics:")

print("\nClassification Report:")

print(classification_report(y_test, y_pred, target_names=['Malignant',


'Benign']))

# Create confusion matrix visualization

plt.figure(figsize=(10, 8))

cm = confusion_matrix(y_test, y_pred)

sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')

plt.title('Confusion Matrix')

plt.ylabel('True Label')

plt.xlabel('Predicted Label')

Page 83
BCSL606 | Machine Learning Lab|

# Visualize the decision tree

plt.figure(figsize=(20,10))

plot_tree(dt_classifier, feature_names=data.feature_names,

class_names=['Malignant', 'Benign'], filled=True, rounded=True)

plt.title('Decision Tree Visualization')

# Function to classify a new sample

def classify_new_sample(sample, feature_names=data.feature_names):

"""

Classify a new sample using the trained decision tree model.

Parameters:

sample (list or array): List of feature values in the same order as the training
data

feature_names (list): List of feature names for reference

Returns:

tuple: (prediction, probability)

"""

sample = np.array(sample).reshape(1, -1)

prediction = dt_classifier.predict(sample)

Page 84
BCSL606 | Machine Learning Lab|

probability = dt_classifier.predict_proba(sample)

print("\nClassification Results:")

print(f"Prediction: {'Benign' if prediction[0] == 1 else 'Malignant'}")

print(f"Probability: Malignant: {probability[0][0]:.2f}, Benign:


{probability[0][1]:.2f}")

# Print feature importance for this prediction

print("\nTop 5 Most Important Features:")

importances = dict(zip(feature_names, dt_classifier.feature_importances_))

sorted_importances = sorted(importances.items(), key=lambda x: x[1],


reverse=True)[:5]

for feature, importance in sorted_importances:

print(f"{feature}: {importance:.4f}")

return prediction[0], probability[0]

# Example of using the classifier with a new sample

# Using mean values from the dataset as an example

example_sample = X_train.mean(axis=0)

print("\nExample Classification:")

Page 85
BCSL606 | Machine Learning Lab|

classify_new_sample(example_sample)

Output

Page 86
BCSL606 | Machine Learning Lab|

Page 87
BCSL606 | Machine Learning Lab|

Explanation

1. Data Preparation and Model Setup

# Loads breast cancer dataset from sklearn

# Features: Various cell nucleus measurements

# Target: Binary (Malignant/Benign)

# Splits data: 80% training, 20% testing

2. Model Configuration

 Uses DecisionTreeClassifier with:

o max_depth=4 (prevents overfitting)

o random_state=42 (reproducibility)

 Fits model using training data

3. Performance Evaluation Components:

 Classification Report shows:

o Precision: Accuracy of positive predictions

o Recall: Ability to find all positive cases

o F1-score: Balance between precision and recall

o Support: Number of samples per class

4. Visualization Elements:

 Confusion Matrix Heatmap:

Page 88
BCSL606 | Machine Learning Lab|

o Shows true vs predicted labels

o Blue intensity indicates number of cases

o Numbers show exact count of predictions

 Decision Tree Visualization:

o Shows complete tree structure

o max_depth=4 keeps it interpretable

o Color-coded nodes show class distribution

o Shows feature splits and thresholds

5. Sample Classification Function

def classify_new_sample(sample, feature_names):

Provides:

 Binary prediction (Malignant/Benign)

 Probability scores for each class

 Top 5 most influential features

 Feature importance scores

6. Key Features:

 Binary Classification Task

 Interpretable Model Structure

 Feature Importance Analysis

 Probability Estimates

Page 89
BCSL606 | Machine Learning Lab|

 Visual Decision Path

7. Use Cases:

 Medical Diagnosis Support

 Feature Importance Understanding

 Risk Assessment

 Decision Process Visualization

Page 90
BCSL606 | Machine Learning Lab|

Experiment-09

Develop a program to implement the Naive Bayesian classifier considering Olivetti Face
Data set for training.

Compute the accuracy of the classifier, considering a few test data sets.

Code:

import numpy as np

from sklearn.datasets import fetch_olivetti_faces

from sklearn.model_selection import train_test_split, cross_val_score

from sklearn.naive_bayes import GaussianNB

from sklearn.metrics import classification_report, confusion_matrix

import matplotlib.pyplot as plt

import seaborn as sns

# Load the Olivetti faces dataset

faces = fetch_olivetti_faces()

X = faces.data

y = faces.target

# Function to display sample faces

def display_sample_faces(X, y, num_samples=5):

"""Display sample faces from the dataset"""

Page 91
BCSL606 | Machine Learning Lab|

fig, axes = plt.subplots(1, num_samples, figsize=(12, 3))

for i, ax in enumerate(axes):

ax.imshow(X[i].reshape(64, 64), cmap='gray')

ax.set_title(f'Person {y[i]}')

ax.axis('off')

plt.tight_layout()

plt.show()

# Split the dataset into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,


random_state=42)

# Initialize and train the Naive Bayes classifier

nb_classifier = GaussianNB()

nb_classifier.fit(X_train, y_train)

# Make predictions

y_pred = nb_classifier.predict(X_test)

# Calculate accuracy

accuracy = nb_classifier.score(X_test, y_test)

Page 92
BCSL606 | Machine Learning Lab|

# Perform cross-validation

cv_scores = cross_val_score(nb_classifier, X, y, cv=5)

# Print performance metrics

print("Performance Metrics:")

print(f"\nAccuracy on test set: {accuracy:.4f}")

print("\nCross-validation scores:")

for i, score in enumerate(cv_scores, 1):

print(f"Fold {i}: {score:.4f}")

print(f"Mean CV Score: {cv_scores.mean():.4f} (+/- {cv_scores.std() * 2:.4f})")

print("\nClassification Report:")

print(classification_report(y_test, y_pred))

# Create confusion matrix visualization

plt.figure(figsize=(12, 8))

cm = confusion_matrix(y_test, y_pred)

sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')

plt.title('Confusion Matrix')

Page 93
BCSL606 | Machine Learning Lab|

plt.ylabel('True Label')

plt.xlabel('Predicted Label')

# Function to test the classifier on specific samples

def test_specific_samples(classifier, X_test, y_test, num_samples=5):

"""Test the classifier on specific samples and display results"""

# Randomly select samples

indices = np.random.choice(len(X_test), num_samples, replace=False)

X_samples = X_test[indices]

y_true = y_test[indices]

# Make predictions

y_pred = classifier.predict(X_samples)

probabilities = classifier.predict_proba(X_samples)

# Display results

fig, axes = plt.subplots(2, num_samples, figsize=(15, 6))

for i in range(num_samples):

# Display the face

axes[0, i].imshow(X_samples[i].reshape(64, 64), cmap='gray')

Page 94
BCSL606 | Machine Learning Lab|

axes[0, i].axis('off')

# Display prediction information

axes[1, i].axis('off')

prediction_text = f'True: {y_true[i]}\nPred: {y_pred[i]}\n'

prediction_text += f'Prob: {probabilities[i][y_pred[i]]:.2f}'

axes[1, i].text(0.5, 0.5, prediction_text,

ha='center', va='center')

# Add color coding for correct/incorrect predictions

if y_true[i] == y_pred[i]:

axes[0, i].set_title('Correct', color='green')

else:

axes[0, i].set_title('Incorrect', color='red')

plt.tight_layout()

plt.show()

# Display sample faces from the dataset

print("\nDisplaying sample faces from the dataset:")

Page 95
BCSL606 | Machine Learning Lab|

display_sample_faces(X, y)

# Test the classifier on specific samples

print("\nTesting classifier on specific samples:")

test_specific_samples(nb_classifier, X_test, y_test)

# Function to analyze misclassifications

def analyze_misclassifications(X_test, y_test, y_pred):

"""Analyze and display misclassified samples"""

misclassified = X_test[y_test != y_pred]

true_labels = y_test[y_test != y_pred]

pred_labels = y_pred[y_test != y_pred]

print(f"\nTotal misclassifications: {len(misclassified)}")

# Display some misclassified examples

num_display = min(5, len(misclassified))

if num_display > 0:

fig, axes = plt.subplots(1, num_display, figsize=(12, 3))

for i in range(num_display):

Page 96
BCSL606 | Machine Learning Lab|

if num_display == 1:

ax = axes

else:

ax = axes[i]

ax.imshow(misclassified[i].reshape(64, 64), cmap='gray')

ax.set_title(f'True: {true_labels[i]}\nPred: {pred_labels[i]}')

ax.axis('off')

plt.tight_layout()

plt.show()

# Analyze misclassifications

print("\nAnalyzing misclassifications:")

analyze_misclassifications(X_test, y_test, y_pred)

Output

Page 97
BCSL606 | Machine Learning Lab|

Page 98
BCSL606 | Machine Learning Lab|

Page 99
BCSL606 | Machine Learning Lab|

Explanation

1. Dataset and Setup

 Uses Olivetti faces dataset (400 images of 40 people)

 Each image is 64x64 pixels in grayscale

 Features are flattened pixel values

 Target is person identifier (0-39)

2. Key Functions:

a) Display Sample Faces:

def display_sample_faces(X, y, num_samples=5):

 Shows sample faces from dataset

 Displays grayscale images with person ID

 Helps visualize input data

b) Test Specific Samples:

def test_specific_samples(classifier, X_test, y_test, num_samples=5):

 Tests classifier on random samples

 Shows both image and predictions

 Color codes correct (green) vs incorrect (red) predictions

 Displays prediction probabilities

c) Analyze Misclassifications:

def analyze_misclassifications(X_test, y_test, y_pred):

Page 100
BCSL606 | Machine Learning Lab|

 Identifies misclassified faces

 Shows true vs predicted labels

 Helps understand where model fails

3. Model Implementation

 Uses GaussianNB (Gaussian Naive Bayes)

 Performs 80-20 train-test split

 Includes cross-validation (5 folds)

4. Performance Evaluation:

 Accuracy on test set

 Cross-validation scores

 Detailed classification report

 Confusion matrix visualization

 Misclassification analysis

5. Visualization Components:

 Sample face display

 Confusion matrix heatmap

 Test results with probability scores

 Misclassified examples

6. Key Features:

Page 101
BCSL606 | Machine Learning Lab|

 Face recognition capability

 Probability estimation

 Error analysis

 Visual result presentation

 Cross-validation performance

7. Notable Aspects:

 Handles high-dimensional data (4096 pixels)

 Provides probability estimates

 Visual feedback for predictions

 Comprehensive error analysis

Page 102
BCSL606 | Machine Learning Lab|

Experiment-10

Develop a program to implement k-means clustering using Wisconsin Breast Cancer


data set and visualize the clustering result.

Code:

import numpy as np

import pandas as pd

from sklearn.datasets import load_breast_cancer

from sklearn.preprocessing import StandardScaler

from sklearn.cluster import KMeans

from sklearn.decomposition import PCA

from sklearn.metrics import silhouette_score

import matplotlib.pyplot as plt

import seaborn as sns

# Load the Wisconsin Breast Cancer dataset

data = load_breast_cancer()

X = data.data

y = data.target # We'll use this only for evaluation

# Create a DataFrame with feature names

df = pd.DataFrame(X, columns=data.feature_names)

Page 103
BCSL606 | Machine Learning Lab|

# Standardize the features

scaler = StandardScaler()

X_scaled = scaler.fit_transform(X)

# Function to determine optimal k using elbow method

def plot_elbow_curve(X, max_k=10):

inertias = []

silhouette_scores = []

k_values = range(2, max_k + 1)

for k in k_values:

kmeans = KMeans(n_clusters=k, random_state=42)

kmeans.fit(X)

inertias.append(kmeans.inertia_)

silhouette_scores.append(silhouette_score(X, kmeans.labels_))

# Plot elbow curve

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))

Page 104
BCSL606 | Machine Learning Lab|

# Inertia plot

ax1.plot(k_values, inertias, 'bo-')

ax1.set_xlabel('Number of Clusters (k)')

ax1.set_ylabel('Inertia')

ax1.set_title('Elbow Method')

# Silhouette score plot

ax2.plot(k_values, silhouette_scores, 'ro-')

ax2.set_xlabel('Number of Clusters (k)')

ax2.set_ylabel('Silhouette Score')

ax2.set_title('Silhouette Analysis')

plt.tight_layout()

plt.show()

return k_values[np.argmax(silhouette_scores)]

# Find optimal k

optimal_k = plot_elbow_curve(X_scaled)

print(f"\nOptimal number of clusters based on silhouette score: {optimal_k}")

Page 105
BCSL606 | Machine Learning Lab|

# Perform k-means clustering with optimal k

kmeans = KMeans(n_clusters=optimal_k, random_state=42)

cluster_labels = kmeans.fit_predict(X_scaled)

# Perform PCA for visualization

pca = PCA(n_components=2)

X_pca = pca.fit_transform(X_scaled)

# Create visualization of clusters

plt.figure(figsize=(12, 8))

scatter = plt.scatter(X_pca[:, 0], X_pca[:, 1], c=cluster_labels, cmap='viridis')

plt.title('K-means Clustering Results (PCA-reduced data)')

plt.xlabel('First Principal Component')

plt.ylabel('Second Principal Component')

plt.colorbar(scatter, label='Cluster')

plt.show()

# Compare clustering results with actual diagnosis

comparison_df = pd.DataFrame({

Page 106
BCSL606 | Machine Learning Lab|

'Cluster': cluster_labels,

'Actual_Diagnosis': y

})

print("\nCluster vs Actual Diagnosis Distribution:")

print(pd.crosstab(comparison_df['Cluster'], comparison_df['Actual_Diagnosis'],

values=np.zeros_like(cluster_labels), aggfunc='count'))

# Analyze cluster characteristics

def analyze_clusters(X, labels, feature_names):

"""Analyze and visualize characteristics of each cluster"""

# Create DataFrame with features and cluster labels

df_analysis = pd.DataFrame(X, columns=feature_names)

df_analysis['Cluster'] = labels

# Calculate mean values for each feature in each cluster

cluster_means = df_analysis.groupby('Cluster').mean()

# Create heatmap of cluster characteristics

plt.figure(figsize=(15, 8))

sns.heatmap(cluster_means, cmap='coolwarm', center=0, annot=True,


fmt='.2f',

Page 107
BCSL606 | Machine Learning Lab|

xticklabels=True, yticklabels=True)

plt.title('Cluster Characteristics (Feature Means)')

plt.xticks(rotation=45, ha='right')

plt.tight_layout()

plt.show()

return cluster_means

# Analyze cluster characteristics

print("\nAnalyzing cluster characteristics:")

cluster_means = analyze_clusters(X_scaled, cluster_labels, data.feature_names)

# Visualize feature importance for clustering

def plot_feature_importance(kmeans, feature_names):

"""Plot feature importance based on cluster centroids"""

# Calculate the variance of centroids for each feature

centroid_variance = np.var(kmeans.cluster_centers_, axis=0)

# Create DataFrame for feature importance

feature_importance = pd.DataFrame({

Page 108
BCSL606 | Machine Learning Lab|

'Feature': feature_names,

'Importance': centroid_variance

}).sort_values('Importance', ascending=False)

# Plot feature importance

plt.figure(figsize=(12, 6))

sns.barplot(x='Importance', y='Feature', data=feature_importance.head(10))

plt.title('Top 10 Most Important Features for Clustering')

plt.tight_layout()

plt.show()

return feature_importance

# Plot feature importance

print("\nAnalyzing feature importance:")

feature_importance = plot_feature_importance(kmeans, data.feature_names)

# Function to predict cluster for new samples

def predict_cluster(sample, scaler, kmeans, feature_names):

"""Predict cluster for a new sample"""

Page 109
BCSL606 | Machine Learning Lab|

# Ensure sample is in correct format

if isinstance(sample, list):

sample = np.array(sample).reshape(1, -1)

# Scale the sample

sample_scaled = scaler.transform(sample)

# Predict cluster

cluster = kmeans.predict(sample_scaled)[0]

# Get distances to all cluster centers

distances = kmeans.transform(sample_scaled)[0]

print(f"\nPredicted Cluster: {cluster}")

print("\nDistances to cluster centers:")

for i, dist in enumerate(distances):

print(f"Cluster {i}: {dist:.2f}")

return cluster, distances

Page 110
BCSL606 | Machine Learning Lab|

# Example of using the prediction function

print("\nExample prediction for a new sample:")

example_sample = X[0:1] # Using first sample as example

predicted_cluster, distances = predict_cluster(example_sample, scaler, kmeans,


data.feature_names)

Output

Page 111
BCSL606 | Machine Learning Lab|

Page 112
BCSL606 | Machine Learning Lab|

Page 113
BCSL606 | Machine Learning Lab|

Explanation

1. Data Preparation:

# Loads breast cancer dataset

# Standardizes features using StandardScaler

# Creates DataFrame with feature names

2. Key Functions:

a) Elbow Method Analysis:

def plot_elbow_curve(X, max_k=10):

 Determines optimal number of clusters

 Plots inertia (within-cluster sum of squares)

 Calculates silhouette scores

 Returns optimal k based on silhouette analysis

b) Cluster Analysis:

def analyze_clusters(X, labels, feature_names):

 Calculates mean values for each feature per cluster

 Creates heatmap of cluster characteristics

 Shows feature patterns in each cluster

c) Feature Importance:

def plot_feature_importance(kmeans, feature_names):

 Calculates feature importance based on centroid variance

Page 114
BCSL606 | Machine Learning Lab|

 Visualizes top 10 most important features

 Helps understand which features drive clustering

3. Visualization Components:

 Elbow curve and silhouette score plots

 PCA-reduced cluster visualization

 Cluster characteristics heatmap

 Feature importance bar plot

4. Model Implementation:

 Uses optimal k from silhouette analysis

 Performs clustering on standardized data

 Reduces dimensionality with PCA for visualization

 Compares clusters with actual diagnosis

5. Cluster Prediction:

def predict_cluster(sample, scaler, kmeans, feature_names):

 Predicts cluster for new samples

 Shows distances to all cluster centers

 Provides confidence measure through distances

6. Key Features:

 Automatic optimal cluster selection

 Dimensionality reduction for visualization

Page 115
BCSL606 | Machine Learning Lab|

 Comprehensive cluster analysis

 Feature importance ranking

 New sample prediction capability

7. Analysis Components:

 Cluster vs actual diagnosis comparison

 Cluster characteristic analysis

 Feature importance visualization

 Distance-based prediction confidence

Page 116

You might also like