0% found this document useful (0 votes)
18 views

Machine_Learning_Lab

This document outlines the Machine Learning Lab course (BAIL 606) for B.E. students in the Department of Artificial Intelligence and Machine Learning at GNDEC, Bidar. It includes a certificate template for successful completion of lab experiments, a detailed table of contents, and descriptions of various machine learning concepts, algorithms, and experiments involving datasets like Iris and Titanic. The document also provides code examples for statistical analysis, data visualization, and dimensionality reduction using PCA.

Uploaded by

Niki Nikhil
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views

Machine_Learning_Lab

This document outlines the Machine Learning Lab course (BAIL 606) for B.E. students in the Department of Artificial Intelligence and Machine Learning at GNDEC, Bidar. It includes a certificate template for successful completion of lab experiments, a detailed table of contents, and descriptions of various machine learning concepts, algorithms, and experiments involving datasets like Iris and Titanic. The document also provides code examples for statistical analysis, data visualization, and dimensionality reduction using PCA.

Uploaded by

Niki Nikhil
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

Department of Artificial Intelligence

and Machine Learning


B.E. - VI Semester
Course Name : MACHINE LEARNING LAB
Course Code : BAIL 606
Certificate

This is to certify that Shri / Kum.

_____________________________________________________

bearing the USN _________________ studying in VI Semester

B.E. in Department of Artificial Intelligence and Machine

Learning, GNDEC, Bidar has successfully completed all the

experiments in Machine Learning Lab (BAIL 606) as prescribed

by the Visvesvaraya Technological University during the

academic year ______________ .

Course In-charge H.O.D

Examiners:

1.

2.
Department of Artificial Intelligence and Machine Learning, GNDEC, Bidar

Page 2 of 45
Department of Artificial Intelligence and Machine Learning, GNDEC, Bidar

Page 3 of 45
Department of Artificial Intelligence and Machine Learning, GNDEC, Bidar

Table of Contents

Sl. Page
Title Date Remarks
No. No.
1 Introduction to Machine Learning 5
Univariate analysis, histogram, box plot,
2 8
bar graph and piechart using Iris dataset
Bivariate analysis, Pearson’s correlation
3 coefficient and heat map using Titanic 14
dataset
Implementation of Principal Component
4 18
Analysis (PCA)
Implementation of k-nearest neighbours
5 21
algorithm (kNN) with k=1,3,5
Implementation of non-parametric
6 Locally Weighted Linear Regression 25
algorithm
Implementation of Linear Regression
using Boston Housing dataset and
7 28
Polynomial Regression using Auto MPG
dataset
Implementation of Decision Tree
8 33
Classifier using Titanic dataset
Implementation of Naive Bayes
9 38
Classifier on Iris dataset
Implementation of K-means clustering
10 42
using Wisconsin Breast Cancer dataset

Page 4 of 45
Department of Artificial Intelligence and Machine Learning, GNDEC, Bidar

Introduction to Machine Learning

1. Data vs Information

Feature Data Information


Definition Raw, unprocessed facts and Processed, structured, and
figures. meaningful data.
Nature Unorganized and discrete Organized, contextual, and
values. meaningful.
Meaning Lacks meaning on its own. Provides insights and
understanding.
Format Numbers, characters, symbols, Reports, summaries, insights,
images, etc. patterns, etc.
Processing Collected but not analyzed. Analyzed, interpreted, and
structured.
Dependency Does not depend on Derived from data.
information.
Example Temperature readings: 30°C, The average temperature this
32°C, 28°C week is 30°C

2. Introduction to Machine Learning (ML)

Machine Learning (ML) is a subset of Artificial Intelligence (AI) that enables


computers to learn patterns from data and make decisions or predictions without
being explicitly programmed. It is widely used in applications like
recommendation systems, image recognition, and fraud detection.

Page 5 of 45
Department of Artificial Intelligence and Machine Learning, GNDEC, Bidar

3. AI vs. Machine Learning vs. Data Science

Feature Artificial Machine Learning Data Science (DS)


Intelligence (AI) (ML)
Definition AI aims to create ML is a subset of AI Data Science is the
intelligent systems that enables systems to field that uses
that simulate human learn from data and statistical and
cognition improve over time. computational
techniques to extract
insights from data.
Focus Decision-making, Pattern recognition Data collection,
automation, and and predictive processing, analysis,
reasoning modeling and visualization
Techniques Machine learning, Supervised, Statistical analysis,
Used deep learning, expert unsupervised, and ML, big data
systems, rule-based reinforcement learning technologies
systems
Example Chatbots, Spam detection, Business
Applications autonomous vehicles, recommendation intelligence, market
smart assistants (e.g., systems, medical analysis, trend
Siri, Alexa) diagnosis forecasting

4. Types of Machine Learning

• Supervised Learning: Uses labeled data (input-output pairs) to train models.


Example: Email spam classification.
• Unsupervised Learning: Works with unlabeled data to find hidden patterns.
Example: Customer segmentation in marketing.
• Reinforcement Learning: Agents learn by interacting with an environment
and receiving rewards/punishments.
Example: Self-driving cars learning optimal driving strategies.

Page 6 of 45
Department of Artificial Intelligence and Machine Learning, GNDEC, Bidar

5. Common Algorithms in Machine Learning

• Supervised Learning Algorithms


• Linear Regression
• Logistic Regression
• Decision Trees
• Random Forest
• Support Vector Machines (SVM)
• Neural Networks
• Unsupervised Learning Algorithms
• K-Means Clustering
• Hierarchical Clustering
• Principal Component Analysis (PCA)

6. Steps in a Machine Learning Workflow

1. Data Collection: Gather relevant data from sources like databases, APIs, or
CSV files.
2. Data Preprocessing: Handle missing values, remove duplicates, and
normalize data.
3. Exploratory Data Analysis (EDA): Use visualization and statistics to
understand data patterns.
4. Feature Engineering: Select, extract, or create new features to improve
model accuracy.
5. Model Selection: Choose an appropriate machine learning algorithm.
6. Model Training: Train the model using historical data.
7. Model Evaluation: Assess performance using metrics like accuracy,
precision, recall, and F1-score.
8. Model Deployment: Deploy the trained model for real-world predictions.
9. Model Monitoring & Improvement: Continuously refine the model based
on new data.

7. Python Libraries for Machine Learning


• NumPy & Pandas: Data manipulation and numerical operations.
• Matplotlib & Seaborn: Data visualization.
• Scikit-learn: Popular ML algorithms and utilities.
• TensorFlow & PyTorch: Deep learning frameworks.

Page 7 of 45
Department of Artificial Intelligence and Machine Learning, GNDEC, Bidar

Experiment 1

Develop a program to Load a dataset and select one numerical


column. Compute mean, median, mode, standard deviation,
variance, and range for a numerical column in a dataset. Generate
a histogram and boxplot to understand the distribution of the
data. Identify any outliers in the data using IQR. Select a
categorical variable from a dataset. Compute the frequency of
each category and display it as a bar chart or pie chart.
________________________________________________________________

Dataset:
The Iris dataset is readily available through the seaborn library, which provides
a convenient interface for loading and working with the data.

Steps for Implementation:


1. Load the Iris dataset.
2. Select a numerical column and compute its mean, median, mode,
standard deviation, variance, and range.
3. Generate a histogram and boxplot to visualize the data distribution.
4. Identify outliers using the Interquartile Range (IQR) method.
5. Select a categorical variable, compute the frequency of each
category, and display the results using both bar and pie charts.

Code:
# Import necessary libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats

# Load the Iris dataset


iris = sns.load_dataset('iris')

# Display the first five rows of the dataset


print("First five rows of the Iris dataset:")
print(iris.head())

# Select a numerical column: 'sepal_width'


numerical_col = 'sepal_width'
data = iris[numerical_col]

Page 8 of 45
Department of Artificial Intelligence and Machine Learning, GNDEC, Bidar

# Compute descriptive statistics


mean_val = data.mean()
median_val = data.median()
mode_val = data.mode()[0] # Mode can have multiple values; take the first one
std_dev = data.std()
variance = data.var()
data_range = data.max() - data.min()

# Display the computed statistics


print(f"\nDescriptive Statistics for '{numerical_col}':")
print(f"Mean: {mean_val}")
print(f"Median: {median_val}")
print(f"Mode: {mode_val}")
print(f"Standard Deviation: {std_dev}")
print(f"Variance: {variance}")
print(f"Range: {data_range}")

# Generate a histogram
plt.figure(figsize=(10, 5))
sns.histplot(data, kde=True, bins=10, color='skyblue')
plt.title(f'Histogram of {numerical_col}')
plt.xlabel(numerical_col)
plt.ylabel('Frequency')
plt.grid(True)
plt.show()

# Generate a boxplot
plt.figure(figsize=(8, 4))
sns.boxplot(x=data, color='lightgreen')
plt.title(f'Boxplot of {numerical_col}')
plt.xlabel(numerical_col)
plt.grid(True)
plt.show()

# Identify outliers using the IQR method


Q1 = data.quantile(0.25)
Q3 = data.quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers = data[(data < lower_bound) | (data > upper_bound)]

Page 9 of 45
Department of Artificial Intelligence and Machine Learning, GNDEC, Bidar

# Display outliers with their index


print(f"\nOutliers in '{numerical_col}' (using IQR method):")
if outliers.empty:
print("No Outliers Found")
else:
for index, value in outliers.items():
print(f"Index: {index}, Value: {value}")

# Select a categorical column: 'species'


categorical_col = 'species'
category_counts = iris[categorical_col].value_counts()

# Display frequency of each category


print(f"\nFrequency of each category in '{categorical_col}':")
print(category_counts)

# Generate a bar chart for the categorical variable


plt.figure(figsize=(8, 5))
sns.barplot(x=category_counts.index, y=category_counts.values,
palette='viridis')
plt.title(f'Bar Chart of {categorical_col}')
plt.xlabel(categorical_col)
plt.ylabel('Frequency')
plt.grid(True)
plt.show()

# Generate a pie chart for the categorical variable


plt.figure(figsize=(8, 8))
plt.pie(category_counts, labels=category_counts.index, autopct='%1.1f%%',
colors=sns.color_palette('viridis', len(category_counts)))
plt.title(f'Pie Chart of {categorical_col}')
plt.show()

Instructions to Run the Code:


1. Access Google Colab: Open Google Colab in your web browser.
2. Create a New Notebook: Click on “File” > “New Notebook” to start a
new notebook.
3. Copy and Paste the Code: Copy the entire script provided above and paste
it into a cell in your Colab notebook.
4. Execute the Cell:Run the cell by clicking the “Play” button on the left or
by pressing Shift + Enter.

Page 10 of 45
Department of Artificial Intelligence and Machine Learning, GNDEC, Bidar

Notes:
• Dataset Loading: The script utilizes the load_dataset function from
seaborn to load the Iris dataset directly, eliminating the need for external file
handling.
• Statistical Computations: The code calculates essential descriptive
statistics for the selected numerical column (sepal_length).
• Visualizations: Histograms and boxplots are generated to visualize
the distribution of the numerical data, while bar and pie charts are used for the
categorical variable (species).
• Outlier Detection: The Interquartile Range (IQR) method is
employed to identify potential outliers in the numerical data.

Output:

Page 11 of 45
Department of Artificial Intelligence and Machine Learning, GNDEC, Bidar

Page 12 of 45
Department of Artificial Intelligence and Machine Learning, GNDEC, Bidar

Conclusion:
This script provides a comprehensive approach to performing descriptive
statistical analysis and visualization on a dataset, facilitating a deeper
understanding of the data’s characteristics.

Page 13 of 45
Department of Artificial Intelligence and Machine Learning, GNDEC, Bidar

Experiment 2

Develop a program to Load a dataset with at least two numerical


columns (e.g., Iris, Titanic). Plot a scatter plot of two variables and
calculate their Pearson correlation coefficient. Write a program to
compute the covariance and correlation matrix for a dataset.
Visualize the correlation matrix using a heatmap to know which
variables have strong positive/negative correlations.
________________________________________________________________

Dataset:
The Titanic dataset is readily available through the seaborn library, which
provides a convenient interface for loading and working with the data.

Steps for Implementation:


1. Loads a dataset with at least two numerical columns (Titanic dataset
from Seaborn).
2. Plots a scatter plot between two variables.
3. Computes and prints the Pearson correlation coefficient.
4. Computes the covariance and correlation matrix.
5. Visualizes the correlation matrix using a heatmap.

Code:
# Import necessary libraries
import seaborn as sns
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Load the Titanic dataset from seaborn


df = sns.load_dataset("titanic")

# Select numerical columns (age and fare for analysis)


df_numeric = df[['age', 'fare']].dropna() # Remove missing values

# Scatter plot between age and fare


plt.figure(figsize=(8, 6))
sns.scatterplot(x=df_numeric['age'], y=df_numeric['fare'], alpha=0.7)
plt.xlabel("Age")
plt.ylabel("Fare")
plt.title("Scatter Plot of Age vs Fare")
plt.show()

Page 14 of 45
Department of Artificial Intelligence and Machine Learning, GNDEC, Bidar

# Calculate Pearson correlation coefficient


pearson_corr = df_numeric['age'].corr(df_numeric['fare'])
print(f"Pearson Correlation Coefficient (Age vs Fare): {pearson_corr:.3f}")

# Compute the covariance matrix


cov_matrix = df_numeric.cov()
print("\nCovariance Matrix:")
print(cov_matrix)

# Compute the correlation matrix


corr_matrix = df_numeric.corr()
print("\nCorrelation Matrix:")
print(corr_matrix)

# Heatmap of correlation matrix


plt.figure(figsize=(6, 4))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt=".2f",
linewidths=0.5)
plt.title("Heatmap of Correlation Matrix")
plt.show()

Note:
• Loads the Titanic dataset using sns.load_dataset("titanic").
• Filters two numerical columns (age and fare) and drops missing
values.
• Scatter plot shows the relationship between age and fare.
• Computes Pearson correlation coefficient using .corr().
• Calculates covariance and correlation matrix.
• Uses sns.heatmap() to visualize correlations in a heatmap.

Page 15 of 45
Department of Artificial Intelligence and Machine Learning, GNDEC, Bidar

Output:

Page 16 of 45
Department of Artificial Intelligence and Machine Learning, GNDEC, Bidar

Conclusion:
This experiment highlights the importance of statistical correlation analysis in
understanding variable relationships within a dataset. While age and fare in the
Titanic dataset do not show a strong correlation, similar techniques can be
applied to explore relationships in other datasets.

Page 17 of 45
Department of Artificial Intelligence and Machine Learning, GNDEC, Bidar

Experiment 3
Develop a program to implement Principal Component Analysis
(PCA) for reducing the dimensionality of the Iris dataset from 4
features to 2.
________________________________________________________________

Dataset:
The Iris dataset is readily available through the seaborn library, which provides
a convenient interface for loading and working with the data.

Steps for Implementation:


1. Load the Iris dataset (which has 4 numerical features).
2. Standardize the data for better PCA performance.
3. Apply PCA to reduce the dataset from 4 features to 2.
4. Visualize the 2D transformed data using a scatter plot.
5. Provide a conclusion summarizing the findings.

Code:
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import datasets
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# Load the Iris dataset


iris = datasets.load_iris()
X = iris.data # Features (4-dimensional)
y = iris.target # Class labels (0, 1, 2)

# Standardize the data (PCA performs better with standardized data)


scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply PCA to reduce dimensions from 4 to 2


pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

# Convert PCA results into a DataFrame for visualization

Page 18 of 45
Department of Artificial Intelligence and Machine Learning, GNDEC, Bidar

pca_df = pd.DataFrame(data=X_pca, columns=['Principal Component 1',


'Principal Component 2'])
pca_df['Target'] = y

# Plot the PCA results


plt.figure(figsize=(8, 6))
sns.scatterplot(x=pca_df['Principal Component 1'], y=pca_df['Principal
Component 2'], hue=pca_df['Target'], palette='viridis')
plt.title('PCA Visualization of Iris Dataset (2D)')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend(title="Species", labels=iris.target_names)
plt.grid(True)
plt.show()

# Explained variance ratio


explained_variance = pca.explained_variance_ratio_
print(f"Explained Variance by PC1: {explained_variance[0]:.2f}")
print(f"Explained Variance by PC2: {explained_variance[1]:.2f}")
print(f"Total Explained Variance: {sum(explained_variance) * 100:.2f}%")

Output:

Page 19 of 45
Department of Artificial Intelligence and Machine Learning, GNDEC, Bidar

Conclusion:
1. Dimensionality Reduction Success:
◦ We successfully reduced the 4-dimensional Iris dataset to 2
dimensions using PCA while retaining most of the information.
2. Variance Explained:
◦ The first two principal components together explain most of the
variance in the dataset.
◦ The explained variance ratio tells us how much information is
preserved after transformation.
3. Visualization of Clusters:
◦ The scatter plot shows that different species of flowers in the Iris
dataset form distinct clusters, confirming that PCA preserves class
separability.

Page 20 of 45
Department of Artificial Intelligence and Machine Learning, GNDEC, Bidar

Experiment 4

Develop a program to load the Iris dataset. Implement the k-


Nearest Neighbors (k-NN) algorithm for classifying flowers based
on their features. Split the dataset into training and testing sets
and evaluate the model using metrics like accuracy and F1-score.
Test it for different values of (e.g., k=1,3,5) and evaluate the
accuracy. Extend the k-NN algorithm to assign weights based on
the distance of neighbors (e.g., ℎ =1/d2) . Compare the
performance of weighted k-NN and regular k-NN on a synthetic
or real-world dataset.
________________________________________________________

Dataset:
The Iris dataset is readily available through the seaborn library, which provides
a convenient interface for loading and working with the data.

Steps for Implementation:


1. Loading the Iris dataset
2. Splitting the data into training and testing sets
3. Implementing k-NN with different values of k (1, 3, 5)
4. Evaluating accuracy and F1-score
5. Extending k-NN to assign weights based on distance (weighted k-NN)
6. Comparing weighted k-NN vs. regular k-NN

Code:
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, f1_score

# Load the Iris dataset


iris = datasets.load_iris()
X = iris.data # Features (sepal length, sepal width, petal length, petal width)
y = iris.target # Class labels (0: Setosa, 1: Versicolor, 2: Virginica)

Page 21 of 45
𝑘
𝑤
𝑒
𝑖
𝑔
𝑡
Department of Artificial Intelligence and Machine Learning, GNDEC, Bidar

# Split the dataset into training (80%) and testing (20%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)

# Standardize the features for better k-NN performance


scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Function to evaluate k-NN with different values of k


def evaluate_knn(k_values, weighted=False):
results = []
for k in k_values:
# Define k-NN classifier (weighted or regular)
if weighted:
knn = KNeighborsClassifier(n_neighbors=k, weights='distance') #
Weighted k-NN
else:
knn = KNeighborsClassifier(n_neighbors=k) # Regular k-NN

# Train the model


knn.fit(X_train, y_train)

# Make predictions
y_pred = knn.predict(X_test)

# Evaluate performance
accuracy = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred, average='weighted')

results.append((k, accuracy, f1))


print(f"k={k} | Accuracy: {accuracy:.4f} | F1-score: {f1:.4f} | {'Weighted'
if weighted else 'Regular'} k-NN")

return results

# Test for different values of k (1, 3, 5) - Regular k-NN


print("Regular k-NN Results:")
regular_results = evaluate_knn(k_values=[1, 3, 5])

# Test for different values of k (1, 3, 5) - Weighted k-NN


print("\nWeighted k-NN Results:")

Page 22 of 45
Department of Artificial Intelligence and Machine Learning, GNDEC, Bidar

weighted_results = evaluate_knn(k_values=[1, 3, 5], weighted=True)

# Convert results to DataFrame for visualization


df_regular = pd.DataFrame(regular_results, columns=['k', 'Accuracy', 'F1-
score'])
df_weighted = pd.DataFrame(weighted_results, columns=['k', 'Accuracy', 'F1-
score'])

# Plot comparison of regular and weighted k-NN


plt.figure(figsize=(8, 5))
plt.plot(df_regular['k'], df_regular['Accuracy'], marker='o', linestyle='-',
label='Regular k-NN')
plt.plot(df_weighted['k'], df_weighted['Accuracy'], marker='s', linestyle='--',
label='Weighted k-NN', color='red')
plt.xlabel('k (Number of Neighbors)')
plt.ylabel('Accuracy')
plt.title('Regular vs Weighted k-NN Accuracy Comparison')
plt.legend()
plt.grid(True)
plt.show()

Output:

Page 23 of 45
Department of Artificial Intelligence and Machine Learning, GNDEC, Bidar

Conclusion:
1. Effect of k on Performance:
◦ For k=1, accuracy is generally high but may overfit.
◦ For k=3 and k=5, accuracy is more stable, reducing overfitting.
◦ Increasing k too much might reduce classification power.
2. Weighted vs. Regular k-NN:
◦ Weighted k-NN (distance-based weighting) often performs
better than regular k-NN, especially for datasets with uneven class
distributions.
◦ The weighting factor (1/d²) gives closer neighbours more
importance, improving classification when noise is present.
k-NN is a simple but effective algorithm for classification. The choice of k and
weighting impacts accuracy significantly. Weighted k-NN is recommended
when different classes have overlapping regions.

Page 24 of 45
Department of Artificial Intelligence and Machine Learning, GNDEC, Bidar

Experiment 6

Implement the non-parametric Locally Weighted Regression


algorithm in order to fit data points. Select appropriate data set
for your experiment and draw graphs.
________________________________________________________

Dataset:
The dataset in the program is a synthetic nonlinear dataset, where the input
feature is X. X consists of 100 evenly spaced points between -3 and 3. The target
variable is y. y follows a sinusoidal function with added Gaussian noise,
making it ideal for testing Locally Weighted Regression (LWR) on a non-
linear pattern.

Steps for Implementation:


1. Load a dataset (we'll use synthetic nonlinear data for better
visualization).
2. Implement Locally Weighted Regression using a Gaussian kernel.
3. Fit and predict new data points.
4. Visualize results using scatter plots and regression curves.

Code:
# Import necessary libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler

# Generate a synthetic dataset (nonlinear function y = sin(x) + noise)


np.random.seed(42)
X = np.linspace(-3, 3, 100) # Input features
y = np.sin(X) + np.random.normal(scale=0.1, size=X.shape) # Target with noise
X = X.reshape(-1, 1) # Reshape for consistency

# Function to compute weights using Gaussian kernel


def get_weights(X_train, x_query, tau):
weights = np.exp(-np.sum((X_train - x_query) ** 2, axis=1) / (2 * tau ** 2))
return np.diag(weights)

Page 25 of 45
Department of Artificial Intelligence and Machine Learning, GNDEC, Bidar

# Locally Weighted Regression (LWR)


def locally_weighted_regression(X_train, y_train, x_query, tau):
m = X_train.shape[0]
X_b = np.hstack((np.ones((m, 1)), X_train)) # Add bias term (x0 = 1)
x_query_b = np.hstack(([1], x_query)) # Bias for query point

# Compute weight matrix


W = get_weights(X_train, x_query, tau)

# Compute theta = (X^T W X)^(-1) X^T W y


theta = np.linalg.inv(X_b.T @ W @ X_b) @ X_b.T @ W @ y_train

return x_query_b @ theta # Prediction for query point

# Predictions using LWR for different tau values


tau_values = [0.1, 0.5, 1.0] # Bandwidth parameter
x_test = np.linspace(-3, 3, 100).reshape(-1, 1) # Test data for smooth curve

plt.figure(figsize=(10, 6))

for tau in tau_values:


y_pred = np.array([locally_weighted_regression(X, y, x, tau) for x in x_test])
plt.plot(x_test, y_pred, label=f'tau={tau}')

# Scatter plot of original data points


plt.scatter(X, y, color='black', alpha=0.5, label="Data Points")

# Graph Labels and Legend


plt.xlabel("X")
plt.ylabel("Y")
plt.title("Locally Weighted Regression (LWR)")
plt.legend()
plt.grid()
plt.show()

Page 26 of 45
Department of Artificial Intelligence and Machine Learning, GNDEC, Bidar

Output:

Conclusion
1. Impact of τ (Bandwidth Parameter):
◦ Small τ (0.1) → Model is highly flexible (overfits, follows noise).
◦ Medium τ (0.5) → Balances flexibility and smoothness.
◦ Large τ (1.0) → Model becomes nearly linear (underfits).
2. Strengths of LWR:
◦ Works well for nonlinear datasets.
◦ No fixed global parameters—each prediction adapts to local
patterns.
3. Limitations:
◦ Computationally expensive (O(n³) per query point due to matrix
inversion).
◦ Not ideal for high-dimensional datasets due to sparsity issues.

Page 27 of 45
Department of Artificial Intelligence and Machine Learning, GNDEC, Bidar

Experiment 7

Develop a program to demonstrate the working of Linear


Regression and Polynomial Regression. Use Boston Housing
Dataset for Linear Regression and Auto MPG Dataset (for vehicle
fuel efficiency prediction) for Polynomial Regression.
________________________________________________________

Datasets:
1. California Housing Dataset (Used for Linear Regression):
◦ This dataset contains information about housing prices in
California, with features such as the average number of rooms
per dwelling (RM) and median home value in different districts.
◦ We use Linear Regression to predict house prices based on the
average number of rooms per dwelling.
2. Auto MPG Dataset (Used for Polynomial Regression):
◦ This dataset provides fuel efficiency (miles per gallon) for cars
based on features like vehicle weight, horsepower, and
acceleration.
◦ We use Polynomial Regression (degree=2) to predict MPG (fuel
efficiency) based on vehicle weight, as the relationship is
nonlinear.

Steps for Implementation:


1. Linear Regression → Predicts house prices using the Boston
Housing Dataset.
2. Polynomial Regression → Predicts vehicle fuel efficiency using the
Auto MPG Dataset.
3. Dataset Descriptions → Brief details on both datasets.
4. Visualizations → Regression plots for better insights.
5. Conclusion → Performance comparison and key takeaways.

Code:
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures, StandardScaler

Page 28 of 45
Department of Artificial Intelligence and Machine Learning, GNDEC, Bidar

from sklearn.linear_model import LinearRegression


from sklearn.metrics import mean_squared_error, r2_score
from sklearn.datasets import fetch_california_housing

# --- LINEAR REGRESSION (Boston Housing Dataset) ---


# Load California Housing Dataset (Boston Housing Dataset is deprecated)
boston = fetch_california_housing()
X_boston = boston.data[:, 5].reshape(-1, 1) # Using RM (average number of
rooms per dwelling)
y_boston = boston.target # House prices

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X_boston, y_boston,
test_size=0.2, random_state=42)

# Train Linear Regression Model


lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)

# Predict
y_pred = lin_reg.predict(X_test)

# Plot Linear Regression Results


plt.figure(figsize=(8, 5))
plt.scatter(X_test, y_test, color='blue', alpha=0.5, label="Actual Prices")
plt.plot(X_test, y_pred, color='red', linewidth=2, label="Linear Regression")
plt.xlabel("Average Number of Rooms (RM)")
plt.ylabel("House Price ($100,000s)")
plt.title("Linear Regression - California Housing Dataset")
plt.legend()
plt.show()

# Print Performance Metrics


print(f"Linear Regression - Mean Squared Error: {mean_squared_error(y_test,
y_pred):.4f}")
print(f"Linear Regression - R² Score: {r2_score(y_test, y_pred):.4f}")

# --- POLYNOMIAL REGRESSION (Auto MPG Dataset) ---


# Load Auto MPG Dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-
mpg.data"
columns = ["mpg", "cylinders", "displacement", "horsepower", "weight",
"acceleration", "model_year", "origin"]

Page 29 of 45
Department of Artificial Intelligence and Machine Learning, GNDEC, Bidar

df = pd.read_csv(url, delim_whitespace=True, names=columns, na_values="?",


dtype=str)

# Data Preprocessing
df = df.dropna() # Remove rows with missing values
df["horsepower"] = df["horsepower"].astype(float) # Convert horsepower to
float

# Feature Selection (Using Weight to Predict MPG)


X_auto = df[["weight"]].values
y_auto = df["mpg"].values

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X_auto, y_auto, test_size=0.2,
random_state=42)

# Apply Polynomial Transformation (Degree 2)


poly = PolynomialFeatures(degree=2)
X_train_poly = poly.fit_transform(X_train)
X_test_poly = poly.transform(X_test)

# Train Polynomial Regression Model


poly_reg = LinearRegression()
poly_reg.fit(X_train_poly, y_train)

# Predict
y_pred_poly = poly_reg.predict(X_test_poly)

# Plot Polynomial Regression Results


plt.figure(figsize=(8, 5))
plt.scatter(X_test, y_test, color='blue', alpha=0.5, label="Actual MPG")
plt.scatter(X_test, y_pred_poly, color='red', alpha=0.5, label="Predicted MPG")
plt.xlabel("Vehicle Weight")
plt.ylabel("Miles Per Gallon (MPG)")
plt.title("Polynomial Regression - Auto MPG Dataset")
plt.legend()
plt.show()

# Print Performance Metrics


print(f"Polynomial Regression - Mean Squared Error:
{mean_squared_error(y_test, y_pred_poly):.4f}")
print(f"Polynomial Regression - R² Score: {r2_score(y_test, y_pred_poly):.4f}")

Page 30 of 45
Department of Artificial Intelligence and Machine Learning, GNDEC, Bidar

Output:

Page 31 of 45
Department of Artificial Intelligence and Machine Learning, GNDEC, Bidar

Conclusion
1. Linear Regression Results:
◦ The California Housing Dataset shows a strong linear
correlation between house prices and the number of rooms per
dwelling.
◦ The R² score suggests how well the model fits the data.
2. Polynomial Regression Results:
◦ The Auto MPG Dataset exhibits a nonlinear relationship between
vehicle weight and fuel efficiency.
◦ Polynomial Regression (degree=2) performs better than Linear
Regression for this dataset.

Linear Regression is effective for simple relationships, while Polynomial


Regression is better for capturing complex, nonlinear patterns.

Page 32 of 45
Department of Artificial Intelligence and Machine Learning, GNDEC, Bidar

Experiment 8
Develop a program to load the Titanic dataset. Split the data into
training and test sets. Train a decision tree classifier. Visualize the
tree structure. Evaluate accuracy, precision, recall, and F1-score.
________________________________________________________

Dataset:
The Titanic dataset is readily available through the seaborn library, which
provides a convenient interface for loading and working with the data. The
Titanic dataset contains information about passengers on the Titanic ship,
including features like Passenger Class, Sex, Age, Fare, and Embarked
Location. The target variable is Survived (1 for survived, 0 for not survived).

Steps for Implementation:


1. Load the Titanic dataset
2. Data Preprocessing
3. Split dataset into Training and Testing Sets
4. Train Decision Tree Classifier
5. Visualize the Decision Tree Structure
6. Evaluate Model using Accuracy, Precision, Recall, and F1-score

Code:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.metrics import accuracy_score, precision_score, recall_score,
f1_score, classification_report

# Load Titanic Dataset


url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/
titanic.csv"
titanic = pd.read_csv(url)

# Display First 5 Rows


print(titanic.head())

Page 33 of 45
Department of Artificial Intelligence and Machine Learning, GNDEC, Bidar

# Data Preprocessing
titanic.drop(["Name", "Ticket", "Cabin", "PassengerId"], axis=1, inplace=True)
titanic["Age"].fillna(titanic["Age"].median(), inplace=True)
titanic["Embarked"].fillna(titanic["Embarked"].mode()[0], inplace=True)

# Encode Categorical Variables


titanic["Sex"] = titanic["Sex"].map({"male": 0, "female": 1})
titanic["Embarked"] = titanic["Embarked"].map({"C": 0, "Q": 1, "S": 2})

# Splitting Dataset into Features and Target Variable


X = titanic.drop("Survived", axis=1) # Features
y = titanic["Survived"] # Target Variable

# Split into Training and Testing Sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)

# Hyperparameter Grid for Decision Tree


param_grid = {
"criterion": ["gini", "entropy"],
"max_depth": [3, 4, 5, 6, None],
"min_samples_split": [2, 5, 10],
"min_samples_leaf": [1, 2, 4]
}

# Grid Search with Cross Validation


grid_search = GridSearchCV(DecisionTreeClassifier(random_state=42),
param_grid, cv=5, scoring="accuracy", verbose=1)
grid_search.fit(X_train, y_train)

# Best Parameters from Grid Search


print("\nBest Hyperparameters:", grid_search.best_params_)

# Train Decision Tree with Best Hyperparameters


best_tree = grid_search.best_estimator_

# Visualize Best Decision Tree


plt.figure(figsize=(18, 12))
plot_tree(best_tree, feature_names=X.columns, class_names=["Not Survived",
"Survived"], filled=True)
plt.title("Optimized Decision Tree Visualization")
plt.show()

Page 34 of 45
Department of Artificial Intelligence and Machine Learning, GNDEC, Bidar

# Predictions
y_pred = best_tree.predict(X_test)

# Evaluation Metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-Score: {f1:.4f}")

# Classification Report
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

Note:
In the above program, we have used Grid Search for Hyperparameter tuning
to find the best model. The best hyperparameters are selected based on the
highest accuracy score during cross-validation.The model is retrained using the
optimal hyperparameters to improve performance.

Output:

Page 35 of 45
Department of Artificial Intelligence and Machine Learning, GNDEC, Bidar

Page 36 of 45
Department of Artificial Intelligence and Machine Learning, GNDEC, Bidar

Conclusion:
1. The Decision Tree model provides an interpretable classification structure
for predicting passenger survival based on their features.
2. Grid Search helps in identifying the best combination of hyperparameters
to avoid underfitting or overfitting.
3. The accuracy, precision, recall, and F1-score give insights into how well
the model performs on the testing data. The optimized Decision Tree model
performs better with increased accuracy and F1-score.
4. The visualization shows how the decision tree makes decisions at each node,
making it easy to understand the model's logic.
5. The final model is more reliable and produces better classification results
on the Titanic dataset.

Page 37 of 45
Department of Artificial Intelligence and Machine Learning, GNDEC, Bidar

Experiment 9
Develop a program to implement the Naive Bayesian classifier
considering Iris dataset for training. Compute the accuracy of the
classifier, considering the test data.
________________________________________________________

Dataset:
The Iris dataset is readily available through the seaborn library, which provides
a convenient interface for loading and working with the data.

Steps for Implementation:


1. Import necessary libraries.
2. Load the Iris dataset.
3. Split the dataset into features and target variable.
4. Divide the dataset into training (70%) and testing (30%) sets.
5. Create the Gaussian Naive Bayes Classifier.
6. Train the classifier using the training dataset.
7. Make predictions using the testing dataset.
8. Calculate accuracy of the classifier.
9. Display the classification report.
10. Visualize the confusion matrix using a heatmap.

Code:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, classification_report,
confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

# Load the Iris dataset


from sklearn.datasets import load_iris
iris = load_iris()

# Convert to DataFrame
iris_df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
iris_df['species'] = iris.target

Page 38 of 45
Department of Artificial Intelligence and Machine Learning, GNDEC, Bidar

# Display first 5 rows of dataset


print("First 5 rows of dataset:")
print(iris_df.head())

# Splitting the dataset into features and target variable


X = iris_df.drop('species', axis=1)
y = iris_df['species']

# Split dataset into training and testing sets (70% training, 30% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=42)

# Create Gaussian Naive Bayes Classifier


nb_classifier = GaussianNB()

# Train the classifier


nb_classifier.fit(X_train, y_train)

# Make predictions on test data


y_pred = nb_classifier.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy of Naive Bayesian Classifier: {accuracy:.4f}")

# Display classification report


print("\nClassification Report:")
print(classification_report(y_test, y_pred))

# Confusion Matrix Visualization


conf_matrix = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(8, 6))
s n s . h e a t m a p ( c o n f _ m a t r i x , a n n o t = Tr u e , c m a p = ' B l u e s ' , f m t = ' d ' ,
xticklabels=iris.target_names, yticklabels=iris.target_names)
plt.title('Confusion Matrix')
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.show()

Page 39 of 45
Department of Artificial Intelligence and Machine Learning, GNDEC, Bidar

Output:

Page 40 of 45
Department of Artificial Intelligence and Machine Learning, GNDEC, Bidar

Conclusion:
The Naive Bayesian Classifier demonstrates good performance on the Iris
dataset due to its simplicity and assumption of feature independence. It is
particularly effective for small datasets and provides competitive accuracy for
this classification task.

Page 41 of 45
Department of Artificial Intelligence and Machine Learning, GNDEC, Bidar

Experiment 10
Develop a program to implement k-means clustering using
Wisconsin Breast Cancer data set and visualize the clustering
result.
________________________________________________________

Dataset:
The Wisconsin Breast Cancer dataset contains information on breast cancer
cases. It consists of 30 numerical features computed from digitized images of
breast mass, such as the mean, standard error, and worst (largest) measurements
for characteristics like radius, texture, perimeter, area, and smoothness. The
target variable 'diagnosis' indicates whether the cancer is malignant or benign.

Steps for Implementation


1. Import necessary libraries.
2. Load the Wisconsin Breast Cancer dataset.
3. Preprocess the dataset by handling missing values and encoding
categorical variables.
4. Scale the dataset to normalize feature values.
5. Implement K-Means Clustering.
6. Visualize the clustering results.
7. Evaluate the clustering performance using metrics like inertia and
silhouette score.
8. Draw conclusions from the clustering results.

Code:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt
import seaborn as sns

# Load the Wisconsin Breast Cancer dataset from UCI Machine Learning
Repository
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-
wisconsin/wdbc.data'
columns = ['id', 'diagnosis'] + [f'feature_{i}' for i in range(1, 31)]
df = pd.read_csv(url, header=None, names=columns)

Page 42 of 45
Department of Artificial Intelligence and Machine Learning, GNDEC, Bidar

# Dataset Description
print("Dataset Description:\n")
print(df.describe())

# Preprocessing
df.drop(['id'], axis=1, inplace=True)
df['diagnosis'] = LabelEncoder().fit_transform(df['diagnosis'])

# Select relevant features to improve clustering performance


X = df[['feature_3', 'feature_4']]
y = df['diagnosis']

# Feature Scaling
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Implement K-Means Clustering with optimized parameters


kmeans = KMeans(n_clusters=2, n_init=500, max_iter=2000, random_state=42)
kmeans.fit(X_scaled)
labels = kmeans.labels_

# Visualize Clustering Results (Using selected features)


plt.figure(figsize=(10, 6))
sns.scatterplot(x=X_scaled[:, 0], y=X_scaled[:, 1], hue=labels, palette='viridis')
plt.title('K-Means Clustering Results')
plt.xlabel('Feature 3')
plt.ylabel('Feature 4')
plt.legend(title='Cluster')
plt.show()

# Evaluate Clustering Performance


inertia = kmeans.inertia_
silhouette = silhouette_score(X_scaled, labels)
print(f"Inertia: {inertia:.4f}")
print(f"Silhouette Score: {silhouette:.4f}")

Page 43 of 45
Department of Artificial Intelligence and Machine Learning, GNDEC, Bidar

Output:

Page 44 of 45
Department of Artificial Intelligence and Machine Learning, GNDEC, Bidar

Conclusion:
The K-Means Clustering algorithm was applied to the Wisconsin Breast
Cancer dataset to group the data into two clusters. The clustering result provides
an unsupervised approach to categorize the data points based on feature
similarity. The Silhouette Score and Inertia metrics were used to evaluate the
clustering performance, indicating that the algorithm effectively separated the
data into meaningful clusters.

Page 45 of 45

You might also like