0% found this document useful (0 votes)
44 views

Machine Learning Lab

The document is a laboratory manual for the Machine Learning lab course BCSL606 at Atria Institute of Technology, detailing the course objectives, syllabus, and assessment methods. It includes instructions for setting up a Python environment, essential libraries, and various machine learning experiments, along with their corresponding code examples. The manual aims to equip students with practical skills in data visualization, machine learning algorithms, and dimensionality reduction techniques.

Uploaded by

killergee330
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views

Machine Learning Lab

The document is a laboratory manual for the Machine Learning lab course BCSL606 at Atria Institute of Technology, detailing the course objectives, syllabus, and assessment methods. It includes instructions for setting up a Python environment, essential libraries, and various machine learning experiments, along with their corresponding code examples. The manual aims to equip students with practical skills in data visualization, machine learning algorithms, and dimensionality reduction techniques.

Uploaded by

killergee330
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

LABORATORY MANUAL

BCSL606 - Machine Learning lab

DEPARTMENT OF INFORMATION SCIENCE & ENGINEERING


ATRIA INSTITUTE OF TECHNOLOGY
Adjacent to Bangalore Baptist Hospital
Hebbal, Bengaluru-560024
Department of
Information Science and Engineering

Vision
To develop competent professionals with strong fundamentals in Information
Science and Engineering, interdisciplinary research and ethical values for the
betterment of the society.

Mission
M1- To establish a transformational learning ambience with good infrastructure
facilities to impart knowledge and the necessary skill set to produce competent
professionals.

M2- To create a new generation of engineers who excel in their career with
leadership/entrepreneur qualities.

M3- To promote sustained research and innovation with an emphasis on ethical


values.
Machine Learning Lab BCSL606

Syllabus
Machine Learning Lab Semester 6
Course Code BCSL606 CIE Marks 50
Teaching Hours/Week (L: T:P: S) 0:0:2:0 SEE Marks 50
Credits 01 Total Marks 100
Examination nature (SEE) Practical
Course objectives:
1. To become familiar with data and visualize univariate, bivariate, and multivariate data using statistical
techniques and dimensionality reduction.
2. To understand various machine learning algorithms such as similarity-based learning, regression, decision
trees, and clustering.
3. To familiarize with learning theories, probability-based models and developing the skills required for
decision-making in dynamic environments.
SETTING UP BASIC COMMANDS:

Step 1: Update System and Install Python


sudo apt update && sudo apt upgrade -y
sudo apt install python3 python3-pip python3-venv -y
Ensures that Python 3 and pip (package manager) are installed and updated.

Step 2: Verify Python and Pip Installation


python3 --version
pip3 --version
Checks the installed versions of Python and pip.

Step 3: Create a Virtual Environment (Recommended)


python3 -m venv ml_env
source ml_env/bin/activate # Activate the virtual environment
Creates an isolated Python environment (ml_env) for Machine Learning projects.
Activating the environment ensures package installations don't affect system-wide Python.

Step 4: Install Essential ML Libraries


pip3 install numpy pandas matplotlib seaborn scikit-learn scipy
Installs common Machine Learning libraries:
• numpy → Numerical operations
• pandas → Data handling
• matplotlib & seaborn → Data visualization
• scikit-learn → Machine Learning algorithms
• scipy → Scientific computing

Step 5: Verify Library Installations


python3 -c "import numpy, pandas, matplotlib, seaborn, sklearn, scipy; print('ML Libraries Installed Successfully!')"
Confirms that all necessary ML libraries are installed correctly.

Step 6: Running a Python Script


Create a simple test script (test_ml.py) to check ML setup:
nano test_ml.py
Copy & paste the Python Code
Save the file (CTRL+X, then Y, then Enter).
Save the file (CTRL+X, then Y, then Enter).
Run the script:
python3 test_ml.py
If everything is set up correctly, it should print accuracy and show a histogram plot.

Step 7: Deactivate Virtual Environment (If used)


deactivate
Closes the virtual environment after running ML programs.
iii
Department of Information Science & Engineering, Atria Institute of Technology
Machine Learning Lab BCSL606

Course outcomes (Course Skill Set):


At the end of the course the student will be able to:
1. Illustrate the principles of multivariate data and apply dimensionality reduction techniques.
2. Demonstrate similarity-based learning methods and perform regression analysis.
3. Develop decision trees for classification and regression problems, and Bayesian models for probabilistic
learning.
1. Implement the clustering algorithms to share computing resources.

Assessment Details (both CIE and SEE)


The weightage of Continuous Internal Evaluation (CIE) is 50% and for Semester End Exam (SEE) is 50%. The
minimum passing mark for the CIE is 40% of the maximum marks (20 marks out of 50) and for the SEE
minimum passing mark is 35% of the maximum marks (18 out of 50 marks). A student shall be deemed to
have satisfied the academic requirements and earned the credits allotted to each subject/ course if the
student secures a minimum of 40% (40 marks out of 100) in the sum total of the CIE (Continuous Internal
Evaluation) and SEE (Semester End Examination) taken together’

Continuous Internal Evaluation (CIE):


CIE marks for the practical course are 50 Marks.
The split-up of CIE marks for record/ journal and test are in the ratio 60:40.
1. Each experiment is to be evaluated for conduction with an observation sheet and record write-up.
Rubrics for the evaluation of the journal/write-up for hardware/software experiments are designed
by the faculty who is handling the laboratory session and are made known to students at the
beginning of the practical session.
2. Record should contain all the specified experiments in the syllabus and each experiment write-up will
be evaluated for 10 marks.
3. Total marks scored by the students are scaled down to 30 marks (60% of maximum marks).
4. Weightage to be given for neatness and submission of record/write-up on time.
5. Department shall conduct a test of 100 marks after the completion of all the experiments listed in the
syllabus.
6. In a test, test write-up, conduction of experiment, acceptable result, and procedural knowledge will
carry a weightage of 60% and the rest 40% for viva-voce.
7. The suitable rubrics can be designed to evaluate each student’s performance and learning ability.
8. The marks scored shall be scaled down to 20 marks (40% of the maximum marks).
The Sum of scaled-down marks scored in the report write-up/journal and marks of a test is the total CIE
marks scored by the student.

Semester End Evaluation (SEE):


1. SEE marks for the practical course are 50 Marks.
2. SEE shall be conducted jointly by the two examiners of the same institute, examiners are appointed
by the Head of the Institute.
3. The examination schedule and names of examiners are informed to the university before the
conduction of the examination. These practical examinations are to be conducted between the
schedule mentioned in the academic calendar of the University.
4. All laboratory experiments are to be included for practical examination.
5. (Rubrics) Breakup of marks and the instructions printed on the cover page of the answer script to
be strictly adhered to by the examiners. OR based on the course requirement evaluation rubrics
shall be decided jointly by examiners.
6. Students can pick one question (experiment) from the questions lot prepared by the examiners
jointly.
7. Evaluation of test write-up/ conduction procedure and result/viva will be conducted jointly by
examiners.
1. General rubrics suggested for SEE are mentioned here, writeup-20%, Conduction procedure and result
in -60%, Viva-voce 20% of maximum marks. SEE for practical shall be evaluated for 100 marks and scored
marks shall be scaled down to 50 marks (however, based on course type, rubrics shall be decided by the
examiners)
Change of experiment is allowed only once and 15% of Marks allotted to the procedure part are to be
made zero.
The minimum duration of SEE is 02 hours
Suggested Learning Resources:
Books:
1. S Sridhar and M Vijayalakshmi, “Machine Learning”, Oxford University Press, 2021.
2. M N Murty and Ananthanarayana V S, “Machine Learning: Theory and Practice”, Universities Press (India) Pvt.
Limited, 2024.
Web links and Video Lectures (e-Resources):
1. https://www.drssridhar.com/?page_id=1053
2. https://www.universitiespress.com/resources?id=9789393330697
3. https://onlinecourses.nptel.ac.in/noc23_cs18/preview

iv
Department of Information Science & Engineering, Atria Institute of Technology
Machine Learning Lab BCSL606

CONTENTS

Sl. No. Name of Experiments Page No.

1 Program No.1 1

2 Program No.2 4

3 Program No.3 7

4 Program No.4 9

5 Program No.5 11

6 Program No.6 13

7 Program No.7 15

8 Program No.8 19

9 Program No.9 22

10 Program No.10 25

11 Summary of Essential Commands 27

12 VIVA QUESTION 28

v
Department of Information Science & Engineering, Atria Institute of Technology
Machine Learning Lab BCSL606

Program 1: Develop a program to create histograms for all numerical features and
analyze the distribution of each feature. Generate box plots for all numerical features
and identify any outliers. Use California Housing dataset.

Code:
# Import necessary libraries
import numpy as np # For numerical computations
import pandas as pd # For handling tabular data
import matplotlib.pyplot as plt # For data visualization
import seaborn as sns # For enhanced statistical plots

# Step 1: Load the dataset


# Read the CSV file into a Pandas DataFrame from the specified path
data = pd.read_csv("/content/drive/MyDrive/dataset/6TH SEM/california_housing.csv")
df = pd.DataFrame(data) # Convert into DataFrame for easy handling

# Step 2: Display basic dataset information


print("Dataset Overview:")
print(df.info()) # Print dataset information (column names, data types, non-null values)
print("\nFirst 5 rows:\n", df.head()) # Print the first 5 rows of the dataset

# Step 3: Create histograms for all numerical features to analyze distributions


plt.figure(figsize=(12, 8)) # Set figure size for better visibility
df.hist(bins=20, figsize=(12, 8), edgecolor='black', grid=False) # Generate histograms with 20 bins
plt.suptitle("Histograms of Numerical Features", fontsize=16) # Add title for the histograms
plt.show() # Display the histograms

# Step 4: Create box plots to identify potential outliers


plt.figure(figsize=(12, 8)) # Set figure size for better clarity
for i, column in enumerate(df.columns): # Loop through each numerical column
plt.subplot(3, 3, i + 1) # Arrange box plots in a grid (3x3)
sns.boxplot(y=df[column], color="lightblue") # Create a vertical box plot with light blue color
plt.title(column) # Set title for each box plot
plt.tight_layout() # Adjust layout to prevent overlap
plt.suptitle("Box Plots of Numerical Features", fontsize=16, y=1.02) # Add title for box plots
plt.show() # Display the box plots

# Step 5: Generate summary statistics to analyze feature distributions


print("\nSummary Statistics:")
print(df.describe()) # Print summary statistics (mean, std, min, max, quartiles)

# Step 6: Identify outliers using Interquartile Range (IQR) method


1
Department of Information Science & Engineering, Atria Institute of Technology
Machine Learning Lab BCSL606
outlier_counts = {} # Dictionary to store outlier counts per feature
for column in df.columns: # Loop through each column in the dataset
Q1 = df[column].quantile(0.25) # Calculate the first quartile (25th percentile)
Q3 = df[column].quantile(0.75) # Calculate the third quartile (75th percentile)
IQR = Q3 - Q1 # Compute the Interquartile Range (IQR)
lower_bound = Q1 - 1.5 * IQR # Calculate lower bound for outliers
upper_bound = Q3 + 1.5 * IQR # Calculate upper bound for outliers
outliers = df[(df[column] < lower_bound) | (df[column] > upper_bound)][column] # Find outlier
values
outlier_counts[column] = len(outliers) # Store the number of outliers for each column

# Step 7: Print the number of outliers detected per feature


print("\nOutliers Detected per Feature:")
for feature, count in outlier_counts.items(): # Loop through detected outliers
print(f"{feature}: {count} outliers") # Print feature name and outlier count

OUTPUT:

2
Department of Information Science & Engineering, Atria Institute of Technology
Machine Learning Lab BCSL606

3
Department of Information Science & Engineering, Atria Institute of Technology
Machine Learning Lab BCSL606

Program 2: Develop a program to Compute the correlation matrix to understand the


relationships between pairs of features. Visualize the correlation matrix using a
heatmap to know which variables have strong positive/negative correlations. Create a
pair plot to visualize pairwise relationships between features. Use California Housing
dataset.

Code:
# Import necessary libraries
import pandas as pd # For handling tabular data efficiently
import seaborn as sns # For visualization, including correlation heatmaps and pair plots
import matplotlib.pyplot as plt # For handling plots and customizing visualizations

# Step 1: Load the dataset


data = pd.read_csv("/content/drive/MyDrive/dataset/6TH SEM/california_housing.csv") # Load the
dataset from the given path
df = pd.DataFrame(data) # Convert the dataset into a pandas DataFrame for easier manipulation

# Step 2: Compute the correlation matrix


# The correlation matrix helps to understand the relationship between numerical features
correlation_matrix = df.corr() # Compute pairwise correlation between features

# Step 3: Visualize the correlation matrix using a heatmap


plt.figure(figsize=(12, 9)) # Set the figure size for better visibility
sns.heatmap(correlation_matrix, # Create a heatmap of the correlation matrix
annot=True, # Display the correlation values inside the heatmap cells
cmap="coolwarm", # Use "coolwarm" color scheme for better contrast
fmt=".2f", # Format correlation values to 2 decimal places
linewidths=0.5) # Add thin lines between cells for better clarity
plt.title("Correlation Matrix Heatmap", fontsize=16) # Add a title to the heatmap
plt.show() # Display the heatmap

# Step 4: Create a pair plot


# A pair plot visualizes pairwise relationships between all numerical features
pairplot = sns.pairplot(df, plot_kws={'alpha': 0.5}) # Reduce marker opacity for better visibility

# Step 5: Adjust the title and spacing to prevent overlap


plt.subplots_adjust(top=0.95) # Adjust the top margin to avoid title overlapping with plots
pairplot.fig.suptitle("Pair Plot of Features", fontsize=16, y=1.02) # Add title and slightly shift it upwards

# Step 6: Show the pair plot


plt.show() # Display the pair plot

4
Department of Information Science & Engineering, Atria Institute of Technology
Machine Learning Lab BCSL606

OUTPUT:

5
Department of Information Science & Engineering, Atria Institute of Technology
Machine Learning Lab BCSL606

6
Department of Information Science & Engineering, Atria Institute of Technology
Machine Learning Lab BCSL606

Program 3:Develop a program to implement Principal Component Analysis (PCA) for


reducing the dimensionality of the Iris dataset from 4 features to 2.

Code:
# Import necessary libraries
import numpy as np # For numerical operations
import pandas as pd # For handling tabular data
import matplotlib.pyplot as plt # For plotting
import seaborn as sns # For better visualizations
from sklearn.decomposition import PCA # PCA algorithm
from sklearn.preprocessing import StandardScaler # Standardization

# Step 1: Load the dataset from the provided path


data = pd.read_csv("/content/drive/MyDrive/dataset/6TH SEM/iris.csv") # Load dataset

# Display the first few rows to check data structure


print("Dataset Sample:\n", data.head())

# Step 2: Convert dataset into DataFrame


df = pd.DataFrame(data)

# Step 3: Standardize the features (PCA is sensitive to scale)


scaler = StandardScaler() # Create a scaler object
X_scaled = scaler.fit_transform(df.iloc[:, :-1]) # Standardize all numerical features (exclude target)

# Step 4: Apply PCA (Reduce dimensions from 4 to 2)


pca = PCA(n_components=2) # Define PCA with 2 principal components
X_pca = pca.fit_transform(X_scaled) # Apply PCA transformation

# Step 5: Convert PCA results into a DataFrame for visualization


pca_df = pd.DataFrame(data=X_pca, columns=['PC1', 'PC2']) # Create new DataFrame with PCA results

# Add the target column (species) for visualization


pca_df['species'] = df['species'] # Use 'species' as the target variable

# Step 6: Visualize PCA-transformed data


plt.figure(figsize=(8, 6))
sns.scatterplot(x='PC1', y='PC2', hue=pca_df['species'], palette='deep', data=pca_df) # Scatter plot
plt.title('PCA of Iris Dataset (4D → 2D)', fontsize=14) # Title
plt.xlabel('Principal Component 1') # X-axis label
plt.ylabel('Principal Component 2') # Y-axis label
plt.legend(title='Species') # Legend for species
7
Department of Information Science & Engineering, Atria Institute of Technology
Machine Learning Lab BCSL606
plt.grid(True) # Show grid for better readability
plt.show() # Display the plot

# Step 7: Print explained variance ratio (How much information each principal component retains)
print("\nExplained Variance Ratio of PCA Components:", pca.explained_variance_ratio_)

OUTPUT:

8
Department of Information Science & Engineering, Atria Institute of Technology
Machine Learning Lab BCSL606

Program 4: For a given set of training data examples stored in a .CSV file, implement and
demonstrate the Find-S algorithm to output a description of the set of all hypotheses
consistent with the training examples.

Code:
# Import necessary libraries
import pandas as pd # For data handling

# Step 1: Load the dataset


data_path = "/content/drive/MyDrive/heart.csv" # Path to dataset
df = pd.read_csv(data_path) # Read CSV into DataFrame

# Step 2: Display dataset overview


print("Dataset Overview:")
print(df.head()) # Show first 5 rows of the dataset

# Step 3: Identify attributes (excluding target column)


attributes = df.columns[:-1] # All columns except the last one
target_column = df.columns[-1] # The target column (last column)

# Step 4: Filter positive examples (target == 1)


positive_examples = df[df[target_column] == 1] # Select row where target = 1

# Step 5: Initialize the most specific hypothesis


hypothesis = list(positive_examples.iloc[0, :-1]) # First positive example's attributes

# Step 6: Apply Find-S Algorithm (Generalizing Hypothesis)


for i in range(1, len(positive_examples)): # Loop through all positive examples
for j in range(len(hypothesis)): # Loop through attributes
if positive_examples.iloc[i, j] != hypothesis[j]:
hypothesis[j] = '?' # Generalize attribute if inconsistent

# Step 7: Output the final hypothesis


print("\nFinal Hypothesis (Find-S Algorithm):")
print(hypothesis)

9
Department of Information Science & Engineering, Atria Institute of Technology
Machine Learning Lab BCSL606

OUTPUT:

10
Department of Information Science & Engineering, Atria Institute of Technology
Machine Learning Lab BCSL606

Program 5: Develop a program to implement k-Nearest Neighbour algorithm to classify


the randomly generated 100 values of x in the range of [0,1]. Perform the following
based on dataset generated.
1. Label the first 50 points {x1,……,x50} as follows: if (xi ≤ 0.5), then xi ε Class1, else xi ε
Class1
2. Classify the remaining points, x51,……,x100 using KNN. Perform this for
k=1,2,3,4,5,20,30

Code:
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier

# Step 1: Generate 100 random values in the range [0,1]


np.random.seed(42) # For reproducibility
X = np.random.rand(100, 1) # Generate 100 random values

# Step 2: Label the first 50 points and assign labels to the entire dataset
y = np.where(X[:50] <= 0.5, 1, 2) # Label the first 50 points as 1 (Class 1 if x <= 0.5)
y = np.concatenate([y, np.where(X[50:] <= 0.5, 1, 2)]) # Label the next 50 points as 1 or 2

# Step 3: Prepare training and testing data


X_train = X[:50] # First 50 points for training
y_train = y[:50].ravel() # Flatten the labels to avoid the DataConversionWarning
X_test = X[50:] # Remaining 50 points for testing
y_test = y[50:].ravel() # Flatten the labels to avoid the DataConversionWarning

# Step 4: Classify using KNN for different values of k


k_values = [1, 2, 3, 4, 5, 20, 30]
predictions = {}

for k in k_values:
knn = KNeighborsClassifier(n_neighbors=k) # Initialize KNN model
knn.fit(X_train, y_train) # Train the model
y_pred = knn.predict(X_test) # Predict on test data
predictions[k] = y_pred # Store predictions

# Step 5: Display predictions

11
Department of Information Science & Engineering, Atria Institute of Technology
Machine Learning Lab BCSL606
for k, pred in predictions.items():
print(f"\nPredictions for k={k}:")
print(pred)

# Step 6: Visualize classification results


plt.figure(figsize=(10, 5))
for k in k_values:
plt.scatter(X_test, predictions[k], label=f'k={k}', alpha=0.6)
plt.axvline(0.5, color='red', linestyle='--', label='Decision Boundary (x=0.5)')
plt.xlabel("X values")
plt.ylabel("Predicted Class")
plt.title("KNN Classification Results for Different k Values")
plt.legend()
plt.show()

OUTPUT:

12
Department of Information Science & Engineering, Atria Institute of Technology
Machine Learning Lab BCSL606

Program 6: Implement the non-parametric Locally Weighted Regression algorithm in


order to fit data points. Select appropriate data set for your experiment and draw graphs

Code:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Load the dataset


data_path = "/content/drive/MyDrive/heart.csv" # Path to dataset
df = pd.read_csv(data_path) # Read CSV into DataFrame

# Select the features for regression (for simplicity, let's use 'age' as feature and 'trestbps' as target)
X = df['age'].values.reshape(-1, 1) # Feature: 'age'
y = df['trestbps'].values # Target: 'trestbps'

# Locally Weighted Regression function


def locally_weighted_regression(X_train, y_train, tau=0.1):
"""
Locally Weighted Linear Regression function.
X_train: Feature data for training
y_train: Target data for training
tau: Smoothing parameter, smaller values make the model more sensitive to local variations
"""
m = len(X_train)
weights = np.zeros((m, m)) # Initialize the weight matrix

# Calculate the predicted weights based on Gaussian kernel


for i in range(m):
diff = X_train - X_train[i] # Difference between the current point and all points
weights[:, i] = np.exp(-np.sum(diff**2, axis=1) / (2 * tau**2)) # Corrected weights for all points

# Compute the coefficients (theta) using the normal equation


X_train = np.hstack((np.ones((m, 1)), X_train)) # Add bias term (intercept)
theta = np.linalg.inv(X_train.T @ weights @ X_train) @ (X_train.T @ weights @ y_train)

return theta

# Fit the model using Locally Weighted Regression


theta = locally_weighted_regression(X, y, tau=0.1)
print("\nCalculated theta (coefficients):", theta)

13
Department of Information Science & Engineering, Atria Institute of Technology
Machine Learning Lab BCSL606
# Predictions function
def predict(X, theta):
"""
Predict the target using the trained model coefficients.
"""
X = np.hstack((np.ones((X.shape[0], 1)), X)) # Add bias term (intercept)
return X @ theta

# Generate predictions
y_pred = predict(X, theta)

# Visualize the results


plt.scatter(X, y, color='blue', label="Original data")
plt.plot(X, y_pred, color='red', label="Locally Weighted Regression")
plt.xlabel('Age')
plt.ylabel('Resting Blood Pressure')
plt.title('Locally Weighted Regression: Age vs Resting Blood Pressure')
plt.legend()
plt.show()

OUTPUT:

14
Department of Information Science & Engineering, Atria Institute of Technology
Machine Learning Lab BCSL606

Program 7: Develop a program to demonstrate the working of Linear Regression and


Polynomial Regression. Use Boston Housing Dataset for Linear Regression and Auto MPG
Dataset (for vehicle fuel efficiency prediction) for Polynomial Regression

Code:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.impute import SimpleImputer # For handling missing values

# Load datasets
boston_data_path = "/content/drive/MyDrive/dataset/6TH SEM/BostonHousing.csv"
auto_mpg_data_path = "/content/drive/MyDrive/dataset/6TH SEM/auto-mpg.csv"

# Boston Housing Dataset (for Linear Regression)


boston_data = pd.read_csv(boston_data_path)
print("\nBoston Housing Dataset Overview:")
print(boston_data.head())

# Prepare features (X) and target (y) for Linear Regression


X_boston = boston_data.iloc[:, :-1].values # Features (all columns except 'medv')
y_boston = boston_data['medv'].values # Target (median value of homes)

# Train-test split for Linear Regression


X_train_boston, X_test_boston, y_train_boston, y_test_boston = train_test_split(X_boston, y_boston,
test_size=0.2, random_state=42)

# Linear Regression Model


lin_reg = LinearRegression()
lin_reg.fit(X_train_boston, y_train_boston)

# Predictions and Performance Evaluation


y_pred_boston = lin_reg.predict(X_test_boston)
print("\nLinear Regression Performance (Boston Housing):")
print(f"Mean Squared Error: {mean_squared_error(y_test_boston, y_pred_boston)}")
print(f"R-squared: {r2_score(y_test_boston, y_pred_boston)}")

# Plotting the true vs predicted values for Linear Regression


plt.figure(figsize=(10, 6))
15
Department of Information Science & Engineering, Atria Institute of Technology
Machine Learning Lab BCSL606
plt.scatter(y_test_boston, y_pred_boston, color='blue')
plt.plot([y_test_boston.min(), y_test_boston.max()], [y_test_boston.min(), y_test_boston.max()],
color='red', linewidth=2)
plt.xlabel('True Prices')
plt.ylabel('Predicted Prices')
plt.title('Linear Regression: True vs Predicted Prices (Boston Housing)')
plt.show()

# Auto MPG Dataset (for Polynomial Regression)


auto_mpg_data = pd.read_csv(auto_mpg_data_path)
print("\nAuto MPG Dataset Overview:")
print(auto_mpg_data.head())

# Handle missing values by imputing them


imputer = SimpleImputer(strategy='mean') # Impute missing values with the mean of the column
auto_mpg_data[['horsepower', 'weight']] = imputer.fit_transform(auto_mpg_data[['horsepower',
'weight']])

# Prepare features (X) and target (y) for Polynomial Regression


X_auto = auto_mpg_data[['horsepower', 'weight']].values # Features: horsepower, weight
y_auto = auto_mpg_data['mpg'].values # Target: miles per gallon

# Train-test split for Polynomial Regression


X_train_auto, X_test_auto, y_train_auto, y_test_auto = train_test_split(X_auto, y_auto, test_size=0.2,
random_state=42)

# Polynomial Regression Model


poly = PolynomialFeatures(degree=2) # Degree of the polynomial (quadratic)
X_train_poly = poly.fit_transform(X_train_auto)
X_test_poly = poly.transform(X_test_auto)

poly_reg = LinearRegression()
poly_reg.fit(X_train_poly, y_train_auto)

# Predictions and Performance Evaluation


y_pred_auto = poly_reg.predict(X_test_poly)
print("\nPolynomial Regression Performance (Auto MPG):")
print(f"Mean Squared Error: {mean_squared_error(y_test_auto, y_pred_auto)}")
print(f"R-squared: {r2_score(y_test_auto, y_pred_auto)}")

# Plotting the true vs predicted values for Polynomial Regression


plt.figure(figsize=(10, 6))
plt.scatter(y_test_auto, y_pred_auto, color='green')
16
Department of Information Science & Engineering, Atria Institute of Technology
Machine Learning Lab BCSL606
plt.plot([y_test_auto.min(), y_test_auto.max()], [y_test_auto.min(), y_test_auto.max()], color='red',
linewidth=2)
plt.xlabel('True MPG')
plt.ylabel('Predicted MPG')
plt.title('Polynomial Regression: True vs Predicted MPG (Auto MPG)')
plt.show()

OUTPUT:

17
Department of Information Science & Engineering, Atria Institute of Technology
Machine Learning Lab BCSL606

18
Department of Information Science & Engineering, Atria Institute of Technology
Machine Learning Lab BCSL606

Program 8: Develop a program to demonstrate the working of the decision tree


algorithm. Use Breast Cancer Data set for building the decision tree and apply this
knowledge to classify a new sample.

Code:
# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Load the dataset


dataset_path = '/content/drive/MyDrive/dataset/breastcancer_modified.csv'
dataset = pd.read_csv(dataset_path)

# Data Preprocessing
X = dataset.drop(['diagnosis'], axis=1) # Features (excluding the target column 'diagnosis')
y = dataset['diagnosis'] # Target variable: 'diagnosis'

# Data Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Data Scaling (using MinMaxScaler)


scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train) # Fit and transform on training data
X_test_scaled = scaler.transform(X_test) # Transform test data based on training data scaling

# Train a Decision Tree Model


clf = DecisionTreeClassifier(random_state=42)
clf.fit(X_train_scaled, y_train)

# Make Predictions
y_pred = clf.predict(X_test_scaled)

# Model Evaluation
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy * 100:.2f}%")
print("\nClassification Report:")
print(classification_report(y_test, y_pred))
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))

19
Department of Information Science & Engineering, Atria Institute of Technology
Machine Learning Lab BCSL606

# Visualize the Decision Tree (optional)


from sklearn.tree import plot_tree
plt.figure(figsize=(20, 10))
plot_tree(clf, filled=True, feature_names=X.columns, class_names=[str(i) for i in clf.classes_],
rounded=True)
plt.title("Decision Tree Visualization")
plt.show()

# Example: Classify a new sample (this matches the structure of the original dataset)
new_sample = pd.DataFrame({
'radius_mean': [15.0], # Example values for features
'texture_mean': [20.0],
'perimeter_mean': [100.0],
'area_mean': [500.0],
'smoothness_mean': [0.1],
'compactness_mean': [0.2],
'concavity_mean': [0.3],
'concave points_mean': [0.15],
'symmetry_mean': [0.3],
'fractal_dimension_mean': [0.05],
'radius_se': [0.1],
'texture_se': [0.2],
'perimeter_se': [0.05],
'area_se': [200.0],
'smoothness_se': [0.02],
'compactness_se': [0.05],
'concavity_se': [0.1],
'concave points_se': [0.03],
'symmetry_se': [0.06],
'fractal_dimension_se': [0.02],
'radius_worst': [25.0],
'texture_worst': [25.0],
'perimeter_worst': [150.0],
'area_worst': [1200.0],
'smoothness_worst': [0.1],
'compactness_worst': [0.5],
'concavity_worst': [0.4],
'concave points_worst': [0.3],
'symmetry_worst': [0.5],
'fractal_dimension_worst': [0.1]
})

20
Department of Information Science & Engineering, Atria Institute of Technology
Machine Learning Lab BCSL606

# Scale the new sample using the same scaler (fit during training)
new_sample_scaled = scaler.transform(new_sample)

# Predict the class using the trained decision tree model


new_prediction = clf.predict(new_sample_scaled)

# Output the prediction result


print("\nNew Sample Prediction:", new_prediction[0])

OUTPUT:

21
Department of Information Science & Engineering, Atria Institute of Technology
Machine Learning Lab BCSL606

Program 9: Develop a program to implement the Naive Bayesian classifier considering


Olivetti Face Data set for training. Compute the accuracy of the classifier, considering a
few test data sets.

Code:
import scipy.io
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Load the Olivetti Faces dataset


mat_file_path = '/content/drive/MyDrive/dataset/6TH SEM/olivettifaces.mat'
mat_data = scipy.io.loadmat(mat_file_path)

# Extract face images from the dataset


faces = mat_data['faces'] # Shape (4096, 400), each column is a 64x64 image
num_samples = faces.shape[1] # Total number of samples

# Reshape faces into a proper format for training


X = faces.T # Transpose to get 400 samples, each as a 4096-dimensional vector

# Generate labels (each person has 10 images, 40 persons)


y = np.array([i // 10 for i in range(num_samples)]) # Labels from 0 to 39

# Split dataset (80% train, 20% test)


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Train Naive Bayes Classifier


nb_classifier = GaussianNB()
nb_classifier.fit(X_train, y_train)

# Predictions
y_pred = nb_classifier.predict(X_test)

# Evaluate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Naive Bayes Classifier Accuracy: {accuracy * 100:.2f}%\n")

# Display classification report (avoid warning by setting zero_division=0)


print("Classification Report:\n", classification_report(y_test, y_pred, zero_division=0))

22
Department of Information Science & Engineering, Atria Institute of Technology
Machine Learning Lab BCSL606

# Display confusion matrix


conf_matrix = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:\n", conf_matrix)

# Visualize 10 test samples with predictions


fig, axes = plt.subplots(2, 5, figsize=(10, 5))
axes = axes.ravel()

for i in range(10):
axes[i].imshow(X_test[i].reshape(64, 64), cmap='gray')
axes[i].set_title(f"Pred: {y_pred[i]}\nActual: {y_test[i]}")
axes[i].axis('off')

plt.tight_layout()
plt.show()

OUTPUT:

23
Department of Information Science & Engineering, Atria Institute of Technology
Machine Learning Lab BCSL606

24
Department of Information Science & Engineering, Atria Institute of Technology
Machine Learning Lab BCSL606

Program 10: Develop a program to implement k-means clustering using Wisconsin


Breast Cancer data set and visualize the clustering result.

Code:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# Load dataset
dataset_path = '/content/drive/MyDrive/dataset/breastcancer_modified.csv'
df = pd.read_csv(dataset_path)

# Display basic info


print("Dataset Overview:\n", df.head())

# Drop non-numeric columns if any (assuming 'id' or 'diagnosis' might be present)


if 'id' in df.columns:
df = df.drop(columns=['id'])
if 'diagnosis' in df.columns:
df = df.drop(columns=['diagnosis']) # Target variable

# Standardize the data


scaler = StandardScaler()
X_scaled = scaler.fit_transform(df)

# Apply K-Means Clustering


k = 2 # Assuming two clusters (Malignant & Benign)
kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
df['Cluster'] = kmeans.fit_predict(X_scaled)

# Visualizing Clusters using PCA


pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

df['PCA1'] = X_pca[:, 0]
df['PCA2'] = X_pca[:, 1]

plt.figure(figsize=(8, 6))
25
Department of Information Science & Engineering, Atria Institute of Technology
Machine Learning Lab BCSL606
sns.scatterplot(x=df['PCA1'], y=df['PCA2'], hue=df['Cluster'], palette='viridis', s=80)
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], c='red', marker='X', s=200,
label="Centroids")
plt.xlabel('PCA Component 1')
plt.ylabel('PCA Component 2')
plt.title('K-Means Clustering Visualization')
plt.legend()
plt.show()

# Print cluster centers


print("Cluster Centers (Original Scaled Data):\n", kmeans.cluster_centers_)

OUTPUT:

26
Department of Information Science & Engineering, Atria Institute of Technology
Machine Learning Lab BCSL606

Summary of Essential Commands


Step Command Purpose
Updates system
1 sudo apt update && sudo apt upgrade -y
packages
2 sudo apt install python3 python3-pip python3-venv -y Installs Python & Pip
3 python3 --version Checks Python version
4 pip3 --version Checks Pip version
Creates a virtual
5 python3 -m venv ml_env
environment
Activates the virtual
6 source ml_env/bin/activate
environment
7 pip3 install numpy pandas matplotlib seaborn scikit-learn scipy Installs ML libraries
python3 -c "import numpy, pandas, matplotlib, seaborn, sklearn, scipy;
8 Verifies installation
print('ML Libraries Installed Successfully!')"
Creates a test ML
9 nano test_ml.py
script
10 python3 test_ml.py Runs the ML script
Deactivates virtual
11 deactivate
environment

27
Department of Information Science & Engineering, Atria Institute of Technology
Machine Learning Lab BCSL606

VIVA QUESTION:
1. What is a histogram, and how does it help in data analysis?
2. How do box plots help in identifying outliers in a dataset?
3. What is the significance of the correlation matrix in data analysis?
4. How does a heatmap help in visualizing the correlation matrix?
5. What is a pair plot, and how does it help in feature analysis?
6. What is Principal Component Analysis (PCA), and why is it used?
7. How does PCA reduce dimensionality while preserving variance?
8. What are eigenvalues and eigenvectors in PCA?
9. Explain the Find-S algorithm and its working principle.
10. What are the assumptions of the Find-S algorithm?
11. What is k-Nearest Neighbors (KNN), and how does it work?
12. How does the choice of k affect the performance of the KNN algorithm?
13. What is the difference between parametric and non-parametric models?
14. How does Locally Weighted Regression differ from standard linear regression?
15. What are the advantages of Locally Weighted Regression?
16. What is the difference between linear regression and polynomial regression?
17. How do you evaluate the performance of a regression model?
18. What are the main components of a decision tree?
19. How does a decision tree split data at each node?
20. What is entropy in the context of decision tree algorithms?

28
Department of Information Science & Engineering, Atria Institute of Technology

You might also like