ML Lab Manual
ML Lab Manual
HASSAN-573201
DEPARTMENT OF
INFORMATION SCIENCE & ENGINEERING
MACHINE LEARNING LAB (BCSL606) ISE
Mission:
❖ To achieve academic excellence in engineering and management through dedication to
duty, offering state of the art education and faith in human values.
Vision:
❖ To be a center of excellence in Information Science and Engineering education by creating
competent professionals to deal with real-world challenges in the industry, research and society.
Mission:
❖ To empower students to become competent professionals, strong in the basics of
Information science and engineering through experiential learning.
❖ To strengthen the education and research ecosystem by inculcating ethical principles, and
facilitating interaction with premier institutes and industries around the world.
❖ Promote innovation and entrepreneurship to fulfill the needs of society and industry
The graduates of Information Science & engineering program of Rajeev Institute of Technology
should be able to attain the following at the time of graduation.
PSO1: Analyze and develop software applications by applying skills in the field of coding
languages, algorithms, operating systems, database management, web design and data analytics.
PSO2: Apply knowledge of computational theory, system design and computer network concepts
for building networking and internet-based applications.
The program educational objectives are the statements that describe the expected achievements of
graduates within first few years of their graduation from the program. The program educational
objectives of Bachelor of Information Science & Engineering at Rajeev Institute of Technology
can be broadly defined as,
PEO1: Analyze, design and implement solutions to real-world problems in the field of
Information Science and Engineering with a multidisciplinary setup.
PEO2: Pursue higher studies with a strong knowledge of basic concepts and skills in Information
Technology disciplines.
PEO3: Adapt to emerging technologies towards continuous learning with ethical values, good
communication skills, leadership qualities and self-learning abilities.
PROGRAM OUTCOMES
PO2- Problem analysis: Identify, formulate, research literature, and analyze complex
engineering problems reaching substantiated conclusions using first principles of mathematics,
natural sciences, and engineering sciences.
PO3- Design/development of solutions: Design solutions for complex engineering problems and
design system components or processes that meet the specified needs with appropriate
consideration for public health and safety, and cultural, societal, and environmental
considerations.
PO5- Modern tool usage: Create, select, and apply appropriate techniques, resources, and
modern engineering and IT tools, including prediction and modeling to complex engineering
activities, with an understanding of the limitations.
PO6- The engineer and society: Apply reasoning informed by the contextual knowledge to
assess societal,health, safety, legal, and cultural issues and the consequent responsibilities relevant
to the professional engineering practice.
PO7- Environment and sustainability: Understand the impact of the professional engineering
solutions in societal and environmental contexts, and demonstrate the knowledge of, and need for
sustainable development.
PO8- Ethics: Apply ethical principles and commit to professional ethics and responsibilities and
norms of the engineering practice.
PO11- Project Management and Finance: Demonstrate knowledge and understanding of the
engineering and management principles and apply these to one’s own work, as a member and
leader in a team, to manage projects and in multidisciplinary environment.
PO12- Life-long learning: Recognize the need for, and have the preparation and ability to
engage in independent and life-long learning in the broadest context of technological change.
CO1: Illustrate the principles of multivariate data and apply dimensionality reduction
techniques.
Course CO2: Demonstrate similarity-based learning methods and perform regression analysis.
Outcomes CO3: Develop decision trees for classification and regression problems, and Bayesian
models for probabilistic learning.
CO4: Implement the clustering algorithms to share computing resources.
Program
List of Program Outcomes Specific
Course Outcomes
Outcomes
(RBT) PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 PO10 PO11 PO12 PSO1 PSO2
CO-1
3 2 2 3 3 - - - - - - - 3 2
(L3)
CO-2
3 3 2 3 3 - - - - - - - 3 1
(L3)
CO-3
3 3 3 3 - - - - - - 3 3 2
(L3)
CO-4
3 3 3 3 3 - - - - - - 3 3 2
(L3)
3 2.75 2.5 3 3 - - - - - - 3 3 1.75
Ave. CO
CONTENTS
Sl. Title Page No.
No.
1 Vision & Mission of the Institute 1
2 Vision & Mission of the Department 1
3 Program Specific Outcomes 2
4 Program Educational Outcomes 2
5 Program Outcomes 3
6 Syllabus 5
7 Contents 8
8 Laboratory manual 9
1. Supervised Learning
o Example: Predicting house prices based on features like size, location, etc.
2. Unsupervised Learning
3. Reinforcement Learning
• Model Evaluation: Metrics like accuracy, precision, recall, and F1-score to assess
performance.
• Data Quality: Noisy, incomplete, or imbalanced data can affect model performance.
• Improves understanding: Helps in identifying patterns and trends that may not be
noticeable in raw data.
o Line Chart: Tracks trends over time (e.g., stock prices, temperature changes).
5. Geospatial Visualization
• R: ggplot2, Shiny
Data visualization plays a crucial role in data analysis and decision-making by making
information more intuitive and accessible. Choosing the right visualization techniques
ensures accurate and meaningful representation of data.
Introduction to PyCharm
PyCharm is a powerful integrated development environment (IDE) for Python, developed by
JetBrains. It provides essential tools for efficient Python development, including code
analysis, debugging, testing, and version control.
• Version Control Integration: Supports Git, SVN, and other VCS tools.
• Web Development Support: Works with Django, Flask, and other frameworks.
Features of PyCharm
1. Code Editor
2. Debugger
6. Testing Frameworks
Editions of PyCharm
• PyCharm Professional (Paid): Advanced tools for web development, databases, and
scientific computing.
PyCharm is a feature-rich IDE that enhances Python development with powerful tools and
automation. It is widely used by professionals and beginners for software development, data
science, and web applications.
1.Develop a program to create histograms for all numerical features and analyse the
distribution of each feature. Generate box plots for all numerical features and identify
any outliers. Use California Housing dataset.
AIM:
To perform an exploratory data analysis (EDA) on the California Housing dataset, focusing
on the numerical features.
OBJECTIVES:
• Load the Dataset: Retrieve the California Housing dataset using the
fetch_california_housing function from the sklearn.datasets module and load it into a
Pandas DataFrame.
• Create Histograms: Generate histograms with KDE (Kernel Density Estimate) for
each numerical feature to visualize their distributions.
• Generate Box Plots: Create box plots for each numerical feature to visualize the
spread and identify potential outliers.
• Detect Outliers: Use the Interquartile Range (IQR) method to detect outliers in each
numerical feature and summarize the number of outliers detected.
• Dataset Summary: Optionally print a statistical summary of the dataset using the
describe method to provide an overview of the key statistics for each numerical
feature.
PROGRAM:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_california_housing
EXPLANATION:
Importing Necessary Libraries
➢ import pandas as pd: Used for handling structured data (dataframes, series).
➢ import numpy as np: Helps with numerical operations and array manipulations.
➢ import seaborn as sns: Enhances data visualization (histograms, box plots).
➢ import matplotlib.pyplot as plt: Used for plotting figures and customizing
visualizations.
➢ from sklearn.datasets import fetch_california_housing: This imports the
fetch_california_housing function from sklearn.datasets. fetch_california_housing is
used to load the California housing dataset, which contains real estate data from
California, including information like median house values, population, and location-
based attributes.
This block identifies outliers using the Interquartile Range (IQR) method:
➢ print ("Outliers Detection:") prints a header for the outlier detection section.
➢ outliers_summary = {} initializes an empty dictionary outliers_summary to store
the number of outliers for each feature.
➢ for feature in numerical_features: A for loop iterates through each numerical
feature.
Formula: IQR=Q3−Q1
Where:
Q1 (First Quartile / 25th Percentile) → The value below which 25% of the data falls.
Q3 (Third Quartile / 75th Percentile) → The value below which 75% of the data falls.
How is IQR Used to Detect Outliers? An outlier is a value that is significantly lower or
higher than the rest of the data. Outliers are typically found using this rule:
Any value below the lower bound or above the upper bound is considered an outlier.
Example Calculation:
IQR = Q3−Q1=37.5−17.5=20
Q3 - Q1 = 37.5 - 17.5 = 20
Q3−Q1=37.5−17.5=20
Outliers:
• Less affected by extreme values than the mean and standard deviation.
OUTPUT:
Figure1 contains histograms of various numerical features from the California Housing
dataset. Following is an analysis of each distribution:
Key Observations
Figure2 contains box plots of numerical features from the California Housing dataset. Box
plots help visualize outliers and data distribution:
The box is very narrow, meaning most data is tightly packed. Many extreme outliers,
indicating that some households report unrealistically high average rooms (e.g., 100+
rooms).
Extremely skewed. Most values are small, but some extreme values exceed 1000,
which is likely a data anomaly.
➢ Latitude & Longitude
These geographical features have a wider range and no significant outliers.This is
expected since they represent locations in California.
➢ MedHouseVal (Median House Value)
The upper limit has many outliers, indicating a capped dataset (house values may be
restricted, possibly at $500,000). Suggests many high-value properties were grouped
at this limit.
Key Observations
Data summary:
AIM:
To perform exploratory data analysis (EDA) on the California Housing dataset by computing
and visualizing the correlation matrix to understand the relationships between pairs of
features. Additionally, the program aims to create pair plots to visualize pairwise
relationships between features, providing deeper insights into the dataset.
OBJECTIVES:
PROGRAM:
# Importing Libraries
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_california_housing
EXPLANATION:
Importing Libraries:
• pandas is used for handling tabular data (dataframes).
• seaborn: A visualization library based on matplotlib that provides a high-level
interface for drawing attractive statistical graphics.
• matplotlib.pyplot is used for plotting graphs.
• fetch_california_housing from sklearn.datasets loads the California Housing
dataset.
sns.pairplot(data): Creates pairwise scatter plots for all numerical features in the dataset.
data: The dataset to plot.
diag_kind='kde': Uses Kernel Density Estimation (KDE) for diagonal plots instead of
histograms to show data distribution.
plot_kws={'alpha': 0.5}: Sets the transparency level of the plots to 50% for better visibility.
plt.suptitle(): Adds a title to the pair plot, adjusting its position (y=1.02).
plt.show(): Displays the pair plot.
3. Identifying Patterns: Useful for detecting multiple peaks (modes) in data, indicating
multimodal distributions.
Properties of a PDF:
The Normal distribution (bell curve) is one of the most common PDFs, defined as:
Where:
OUTPUT:
Figure1 shows the Correlation Matrix Heatmap of California Housing Features. The heatmap
visualizes the correlation coefficients between the different housing-related variables in the
California Housing dataset
The diagonal elements are all 1.00 because a variable is always perfectly correlated with
itself.
Geographical Trends: The strong negative correlation between Latitude and Longitude
suggests location-based patterns in housing prices.
This analysis helps in understanding feature relationships, guiding feature selection, data
preprocessing, and model building for price prediction in the California housing dataset.
The pair plot shown in Figure2.2 visualizes relationships between multiple variables in the
California Housing dataset. It includes scatter plots for each pair of features and histograms
or KDE plots on the diagonal.
Key Observations:
• AveRooms vs. AveBedrms: Strong positive correlation, as houses with more rooms
generally have more bedrooms.
• MedInc (Median Income) and MedHouseVal (Median House Value) show a clear
positive trend, indicating that areas with higher incomes tend to have higher house
values.
The pair plot helps in understanding relationships, feature importance, and detecting
outliers before using the dataset for predictive modeling. The observed trends, especially
between MedInc and MedHouseVal, confirm that income levels play a major role in house
prices.
AIM:
To visualize the Iris dataset using Principal Component Analysis (PCA) to reduce its
dimensionality from 4 to 2 dimensions. The reduced data is then plotted to observe the
separation of different Iris species.
OBJECTIVES:
➢ Load and Explore Data: Load the Iris dataset and convert it to a panda DataFrame
for easier visualization and manipulation.
➢ Dimensionality Reduction: Apply PCA to reduce the dimensionality of the dataset
from 4 features to 2 principal components.
➢ Data Transformation: Create a new DataFrame containing the reduced data along
with the corresponding labels.
➢ Visualization: Plot the reduced data on a 2D plane, with each point representing a
sample from the Iris dataset and colors indicating different Iris species.
➢ Analysis and Interpretation: Analyze the plot to determine the separation and
clustering of the different Iris species based on the principal components.
PROGRAM:
# Importing Libraries
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
label_names = iris.target_names
# Convert to a DataFrame for better visualization
iris_df = pd.DataFrame(data, columns=iris.feature_names)
# Perform PCA to reduce dimensionality to 2
pca = PCA(n_components=2)
data_reduced = pca.fit_transform(data)
EXPLANATION:
Importing Libraries
numpy (np): A library for numerical operations.
pandas (pd): A library for data manipulation and analysis.
NALINI H C, DEPARTMENT OF IS&E, RIT, HASSAN 39
MACHINE LEARNING LAB (BCSL606)
• Creates a DataFrame with the reduced data, naming the columns 'Principal
Component 1' and 'Principal Component 2'.
NALINI H C, DEPARTMENT OF IS&E, RIT, HASSAN 40
MACHINE LEARNING LAB (BCSL606)
• Adds the Label column containing the original class labels to the DataFrame.
OUTPUT:
Figure3.1 is a visual representation of the Iris dataset after applying Principal Component
Analysis (PCA) to reduce its dimensionality from 4 to 2 components.
4. For a given set of training data examples stored in a .CSV file, implement and
demonstrate the Find-S algorithm to output a description of the set of all hypotheses
consistent with the training examples.
AIM:
To implement the Find-S algorithm, which is a simple machine learning algorithm used for
learning the most specific hypothesis that fits all the positive examples in a given training
dataset.
OBJECTIVES:
➢ Load Training Data: Import the training data from a CSV file using the Pandas
library to create a DataFrame.
➢ Initialize Hypothesis: Set the initial hypothesis to the most general hypothesis (i.e.,
all attributes are set to '?').
➢ Iterate Through Examples: Iterate through each example in the training data to
refine the hypothesis.
➢ Update Hypothesis: For each positive example (where the class label is 'Yes'),
update the hypothesis by keeping only the consistent attribute values and generalizing
others.
➢ Output Hypothesis: After processing all positive examples, output the final
hypothesis, which represents the most specific description of the positive examples.
PROGRAM:
print(data)
# Extracting Attributes and Class Labels
attributes = data.columns[:-1]
class_label = data.columns[-1]
# Initializing the Hypothesis
hypothesis = ['?' for _ in attributes]
EXPLANATION:
import pandas as pd imports the pandas library, which is essential for handling and
manipulating the dataset.
def find_s_algorithm(file_path): defines the find_s_algorithm function, which takes a
single argument file_path representing the path to the training data CSV file.
Reading the CSV File
data = pd.read_csv(file_path)
print("Training data:")
print(data)
The block reads the CSV file into a DataFrame named data and prints the training data. The
pd.read_csv(file_path) function is used to read the CSV file.
for index, row in data.iterrows(): starts a loop that iterates through each row of the training
data. data.iterrows() returns an iterator that produces index and row pairs.
if row[class_label] == 'Yes': This condition checks if the class label of the current row is
'Yes'. Only positive examples are used to update the hypothesis.
hypothesis[i] = value
else:
hypothesis[i] = '?'
This nested loop iterates through each attribute value in the current row. The hypothesis is
updated as follows:
• If the hypothesis at index i is '?' or matches the current attribute value, it is set to the
current value.
return hypothesis returns the final hypothesis after processing all positive examples.
file_path = r'C:\Users\Nalini\Downloads\training_data.csv'
hypothesis = find_s_algorithm(file_path)
These lines specify the file path to the training data, call the find_s_algorithm function, and
print the final hypothesis.
OUTPUT:
Training Data
The training data consists of various weather conditions and whether or not tennis was played
(PlayTennis column). The attributes are Outlook, Temperature, Humidity, and Wind.
The Find-S algorithm aims to find the most specific hypothesis that fits all positive examples
(i.e., where PlayTennis is 'Yes').
NALINI H C, DEPARTMENT OF IS&E, RIT, HASSAN 47
MACHINE LEARNING LAB (BCSL606)
Positive Examples
Here are the positive examples from the training data (rows where PlayTennis is 'Yes'):
Hypothesis Updates
10. Example 9 (Overcast, Hot, Normal, Weak): ['Overcast', '?', 'Normal', '?']
Final Hypothesis
The Find-S algorithm provides the most specific hypothesis that fits all positive examples in
the training data. In this case, the final hypothesis suggests that if the Outlook is 'Overcast'
and Humidity is 'Normal', then PlayTennis is 'Yes', regardless of Temperature and Wind.
AIM:
To implement a k-Nearest Neighbors (k-NN) classifier to classify a set of data points based
on their Euclidean distance to the nearest neighbors in the training dataset. The program
explores the effect of different values of kk on the classification accuracy and performance.
OBJECTIVES:
PROGRAM:
#Import Libraries:
import numpy as np
import matplotlib.pyplot as plt
from collections import Counter
print("Training dataset: First 50 points labeled based on the rule (x <= 0.5 -> Class1, x > 0.5 -
> Class2)")
print("Testing dataset: Remaining 50 points to be classified\n")
plt.ylabel("Classification Level")
plt.legend()
plt.grid(True)
plt.show()
EXPLANATION:
Import Libraries:
numpy: For numerical operations.
matplotlib.pyplot: For plotting, although not used in this code.
collections.Counter: To count the frequency of elements.
Generate Random Data:
data = np.random.rand(100) creates an array of 100 random values uniformly distributed
between 0 and 1.
Label Data Points:
labels = ["Class1" if x <= 0.5 else "Class2" for x in data[:50]]
Label the first 50 data points as "Class1" if the value is less than or equal to 0.5, otherwise
label them as "Class2".
Define Euclidean Distance Function:
def euclidean_distance(x1, x2):
return np.sqrt((x1 - x2)**2)
calculates the Euclidean distance between two points x1 and x2.
Define k-NN classifier function:
def knn_classifier(train_data, train_labels, test_point, k): classifies a new test point using
the k-nearest neighbors (k-NN) algorithm.
distances = [(euclidean_distance(test_point, train_data[i]), train_labels[i]) for i in
range(len(train_data))] computes the distance between the test point and every training data
point. Stores distances along with corresponding labels.
distances.sort(key=lambda x: x[0]) sorts distances in ascending order (smallest distance
first).
k_nearest_neighbors = distances[:k] selects the k closest training points.
k_nearest_labels = [label for _, label in k_nearest_neighbors] extracts the labels of these
k nearest points.
OUTPUT:
The following output graphs represent k-Nearest Neighbors (k-NN) classification results for
different values of kkk using a simple 1D dataset. Below is an analysis of the graphs:
General Layout:
• The blue circles at the bottom (classification level = 0) represent the training data.
• The blue and red crosses at the top (classification level = 1) represent test data
classified as Class1 and Class2, respectively.
• The horizontal axis represents the data points (values between 0 and 1), and the
classification is performed using k-NN.
• 1k=1:
o Each test point is classified based on its nearest neighbor, leading to potential
misclassification due to local outliers.
• k=2:
o There is some smoothing compared to k=1, but decisions are still largely
based on very local patterns.
• k=3and k=4:
o The decision boundary between Class1 (blue crosses) and Class2 (red
crosses) becomes more distinct.
Key Takeaways:
If the dataset has a clear boundary, choosing a small kmight work well. If the dataset has
some noise, a moderate k (like 3-5) is ideal. If the dataset is highly variable, a larger k
smooths the results but may oversimplify the classification.
AIM:
To implement Locally Weighted Regression (LWR) using a Gaussian kernel, fit the model to
a noisy sine wave dataset, and visualize the resulting fit. The program demonstrates how
LWR can be used to create smooth approximations of noisy data by weighing training points
based on their proximity to the query point.
OBJECTIVES:
➢ Generate Data:
Create a noisy sine wave dataset to be used for training and testing the LWR model.
➢ Define Gaussian Kernel Function:
Implement a function to calculate the Gaussian kernel, which will be used to weigh
training points.
➢ Implement LWR Function:
Develop the Locally Weighted Regression function to fit a model to the data,
considering weights based on the Gaussian kernel.
➢ Make Predictions:
Use the LWR function to make predictions for new data points, creating a smooth
curve that approximates the underlying sine wave.
➢ Visualize Results: Plot the training data and the LWR fit to visually compare the
noisy data and the smooth approximation.
PROGRAM:
import numpy as np
# Calculate the Gaussian kernel (similarity) between point x and xi with bandwidth tau
m = X.shape[0]
# Calculate weights using the Gaussian kernel for all training points with respect to x
W = np.diag(weights)
X_transpose_W = X.T @ W
# Compute the coefficients theta by solving the weighted least squares problem
return x @ theta
np.random.seed(42)
X_bias = np.c_[np.ones(X.shape), X]
tau = 0.5
# Make predictions for each test point using the LWR function
plt.figure(figsize=(10, 6))
plt.xlabel('X', fontsize=12)
plt.ylabel('y', fontsize=12)
plt.legend(fontsize=10)
plt.grid(alpha=0.3)
plt.show()
EXPLANATION:
Importing Libraries:
numpy: A library for numerical operations and handling arrays.
matplotlib.pyplot: A library for plotting graphs and visualizing data.
Defining the Gaussian Kernel Function:
This function calculates the Gaussian kernel, which is a measure of similarity between two
points x and xi.
The parameter tau controls the bandwidth of the kernel. Smaller tau results in a narrower
kernel, giving more weight to closer points.
Defining the Locally Weighted Regression (LWR) Function:
np.random.seed(42): Sets the random seed for reproducibility, ensuring the same random
values are generated each time.
X: Generates 100 evenly spaced points between 0 and 2π2\pi.
y: Generates noisy sine wave data by adding random noise to the sine of X.
Adding Bias Term to Training Data:
X_bias: Adds a column of ones to X to account for the bias term in linear regression.
Generating Test Data and Adding Bias Term:
tau = 0.5: Sets the bandwidth parameter, controlling the influence of nearby points.
y_pred: Calculates predictions for each point in x_test_bias using the LWR function,
generating a smooth approximation of the noisy sine wave data.
Plotting the Results:
OUTPUT:
The output of the program is a plot that visualizes the results of Locally Weighted Regression
(LWR) applied to a noisy sine wave dataset. Let's analyze and explain the various
components of the plot:
1. Training Data:
o Red Scatter Points: These points represent the noisy sine wave data used to
train the LWR model. The X-axis values range from 0 to 2π2\pi, and the Y-
axis values are the sine of the X values with added Gaussian noise. The noise
simulates real-world data that often contains some randomness or variability.
2. LWR Fit:
o Blue Curve: This smooth curve represents the fit of the LWR model to the
noisy training data. The LWR model makes predictions for new data points by
considering the weights of nearby training points. The bandwidth parameter
(tau) controls the influence of nearby points. In this case, tau is set to 0.5,
resulting in a fit that balances smoothness and adherence to the training data.
o X-Axis (X): Represents the input values, which range from 0 to 2π2\pi. These
are the points at which the sine function is evaluated.
o Y-Axis (y): Represents the output values of the sine function with added
noise. These values are used as the target values for the LWR model.
4. Plot Details:
o Labels: The X-axis is labeled "X" and the Y-axis is labeled "y". These labels
provide context for the input and output values.
o Legend: The legend differentiates between the training data (red scatter
points) and the LWR fit (blue curve). The legend also includes the value of tau
used in the LWR model.
o Grid: A light grid is added to the plot for better visualization of data points
and the fit.
• The LWR Fit (blue curve) closely follows the general trend of the noisy sine wave
data (red points).
• The smooth curve suggests that the LWR model has successfully captured the
underlying sine wave pattern despite the presence of noise.
• The chosen tau value (0.5) results in a fit that is neither too rigid nor too flexible. A
smaller tau would lead to a more flexible fit, potentially overfitting the noise, while a
larger tau would lead to a smoother fit, potentially underfitting the data.
The plot effectively demonstrates the capability of Locally Weighted Regression to handle
noisy data and provide a smooth approximation of the underlying pattern. By assigning
higher weights to nearby points, the LWR model is able to make accurate local predictions
while maintaining overall smoothness.
AIM:
OBJECTIVES:
PROGRAM:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
# Create and train the polynomial regression model (degree=2) with standard scaling
poly_model = make_pipeline(PolynomialFeatures(degree=2), StandardScaler(),
LinearRegression())
poly_model.fit(X_train, y_train)
Importing Libraries:
Main Function:
• Demonstrates the linear regression on the California Housing dataset and polynomial
regression on the Auto MPG dataset by calling the respective functions.
This program demonstrates how to perform linear and polynomial regression on different
datasets and evaluates the model's performance using appropriate metrics.
OUTPUT:
Figure7.1 shows the output for the program. The output of the program demonstrates the
effectiveness of linear and polynomial regression on different datasets. In the case of Linear
Regression using the California Housing dataset, the model was trained using only one
feature, "AveRooms" (average number of rooms per dwelling), to predict median house
values. The Mean Squared Error (MSE) of 1.2923 indicates a relatively high error in
predictions, and the R² score of 0.0138 suggests that the model explains only 1.38% of the
variance in house prices. This poor performance is likely due to the fact that house prices are
NALINI H C, DEPARTMENT OF IS&E, RIT, HASSAN 77
MACHINE LEARNING LAB (BCSL606)
influenced by multiple factors such as income levels, crime rates, and proximity to amenities,
which were not considered in this model. On the other hand, the Polynomial Regression on
the Auto MPG dataset shows significantly better results. The model used the
"Displacement" feature to predict miles per gallon (MPG), achieving an MSE of 0.7431 and
an R² score of 0.7506, indicating that the model explains 75.06% of the variance in fuel
efficiency. This suggests that polynomial regression is a better fit for this dataset, likely due
to the non-linear relationship between engine displacement and fuel efficiency. Overall, while
linear regression struggled due to an insufficient feature set, polynomial regression
demonstrated a strong correlation, emphasizing the importance of selecting appropriate
features and models for different datasets. To improve the performance of linear regression
on the California Housing dataset, it would be beneficial to include additional relevant
features such as income levels, population, and location-related attributes.
The plot in the Figure7.2 visualizes the Linear Regression model applied to the California
Housing dataset, where the x-axis represents the average number of rooms per dwelling
(AveRooms) and the y-axis represents the median house value (in $100,000s). The blue
scatter points depict the actual house prices, while the red line represents the model's
predicted values.
From the visualization, it is evident that the model does not perform well. The predicted trend
line (red) does not closely fit the actual data points, suggesting a poor linear relationship
between the number of rooms and house prices. Most data points are clustered around lower
values of "AveRooms" (0-10), but the model extends the prediction line unrealistically for
higher room counts, leading to inaccurate predictions. This aligns with the low R² score of
0.0138 observed earlier, indicating that only 1.38% of the variance in house prices is
explained by the model.
A likely reason for this poor performance is that house prices are influenced by multiple
factors, such as income levels, location, population density, and other economic variables,
rather than just the number of rooms. To improve the model's predictive power, incorporating
additional relevant features like median income, crime rate, and proximity to amenities would
be necessary.
The plot shown in the Figure7.3 represents the Polynomial Regression model applied to the
Auto MPG dataset, where the x-axis represents the engine displacement and the y-axis
represents the fuel efficiency (miles per gallon - MPG). The blue scatter points indicate the
actual MPG values, while the red points represent the model's predicted values.
From the visualization, it is evident that the polynomial regression model performs well in
capturing the trend between displacement and fuel efficiency. Unlike linear regression, which
would fit a straight line, polynomial regression allows for a more flexible curve, better
accommodating the relationship between engine displacement and MPG. The predicted
points (red) closely follow the distribution of actual data points (blue), showing a strong
correlation.
8. Develop a program to demonstrate the working of the decision tree algorithm. Use
Breast Cancer Data set for building the decision tree and apply this knowledge to
classify a new sample.
AIM:
To build a Decision Tree classifier using the Breast Cancer dataset to predict whether a given
tumor is benign or malignant. The objective is to evaluate the classifier's accuracy and
visualize the resulting Decision Tree.
OBJECTIVES:
➢ Data Splitting:
• Split the dataset into training and testing sets using an 80-20 split ratio.
➢ Model Building:
➢ Model Evaluation:
• Use the trained classifier to predict the class (benign or malignant) of a new
sample from the testing set.
➢ Visualization:
• Visualize the Decision Tree using matplotlib and sklearn's plot_tree function.
• Include feature names and class names in the visualization for better
interpretation.
PROGRAM:
# Split the dataset into training and testing sets (80-20 split)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Initialize the Decision Tree classifier with a random state for reproducibility
clf = DecisionTreeClassifier(random_state=42)
y_pred = clf.predict(X_test)
EXPLANATION:
Importing Necessary Libraries:
• numpy as np: Imports NumPy, a library for numerical operations, using alias np.
• matplotlib.pyplot as plt: Imports Matplotlib’s pyplot module for plotting.
• load_breast_cancer: Imports the breast cancer dataset from sklearn.datasets.
OUTPUT:
Figure8.1 shows a Decision Tree visualization for the Breast Cancer dataset, generated using
the tree.plot_tree() function in sklearn.
• Each leaf node (bottom-most nodes) represents a final classification (either Malignant
or Benign).
• The branches represent decision paths taken based on feature values.
• worst radius = 15
• worst texture = 20
Steps in classification:
1. The root node checks if worst radius ≤ 16.795. Since 15 < 16.795, move left.
2. Next, check worst texture ≤ 18.5. Since 20 > 18.5, move right.
3. Continue this process until reaching a leaf node (final classification).
The root node contains the most important feature for classification. Each level represents a
split based on the most discriminative features.
The tree depth affects complexity: A deeper tree means more splits, but it may lead to
overfitting. Most branches lead to highly pure leaf nodes, meaning the model is confident in
its classifications.
AIM:
To implement a facial recognition system using the Olivetti Faces dataset and evaluate its
performance using a Gaussian Naive Bayes classifier.
OBJECTIVES:
➢ Data Preparation:
Fetch the Olivetti Faces dataset, shuffle the data, and split it into training and testing
sets.
➢ Model Implementation:
Train a Gaussian Naive Bayes classifier on the training data.
➢ Performance Evaluation:
Assess the accuracy of the classifier using the test data and generate performance
metrics including classification report and confusion matrix.
➢ Cross-validation:
Perform cross-validation to validate the model's robustness and report the average
accuracy.
➢ Visualization:
Visualize the results by displaying sample test images along with their true and
predicted labels.
PROGRAM:
import numpy as np
# Importing the Olivetti Faces dataset
from sklearn.datasets import fetch_olivetti_faces
# For splitting the data and cross-validation
from sklearn.model_selection import train_test_split, cross_val_score
# Importing the Gaussian Naive Bayes classifier
# Fetch the Olivetti Faces dataset, shuffle it, and ensure reproducibility with a random
#state
data = fetch_olivetti_faces(shuffle=True, random_state=42)
X = data.data # Features (image data)
y = data.target # Labels (person IDs)
# Visualize some of the test images along with their true and predicted labels
fig, axes = plt.subplots(3, 5, figsize=(12, 8))
for ax, image, label, prediction in zip(axes.ravel(), X_test, y_test, y_pred):
ax.imshow(image.reshape(64, 64), cmap=plt.cm.gray) # Display the image in grayscale
# Set the title to show true and predicted labels
ax.set_title(f"True: {label}, Pred: {prediction}")
ax.axis('off') # Hide the axes
plt.show()
EXPLANATION:
Importing Required Libraries:
import numpy as np imports the NumPy library, which is used for numerical computations
and handling arrays.
from sklearn.datasets import fetch_olivetti_faces imports the fetch_olivetti_faces function
from sklearn.datasets, which loads the Olivetti Faces dataset (a dataset containing grayscale
images of faces).
from sklearn.model_selection import train_test_split, cross_val_score
• train_test_split: Splits the dataset into training and testing sets.
• cross_val_score: Performs cross-validation to evaluate the model.
from sklearn.naive_bayes import GaussianNB imports the GaussianNB classifier, a Naive
Bayes classifier based on a Gaussian (normal) distribution.
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
• accuracy_score: Measures the accuracy of the model.
• classification_report: Provides detailed metrics (precision, recall, F1-score) for each
class.
• confusion_matrix: Shows how many times each class was correctly or incorrectly
classified.
import matplotlib.pyplot as plt imports matplotlib.pyplot for visualizing images and results.
X = data.data extracts the feature matrix (X), where each row represents an image (flattened
into a 1D array).
y = data.target extracts the target labels (y), which represent the person ID (0–39, as there
are 40 individuals).
Making Predictions:
y_pred = gnb.predict(X_test) uses the trained model to predict the labels of the test dataset
• Prints the confusion matrix, which shows the number of correct and incorrect
predictions per class.
Performing Cross-Validation:
for ax, image, label, prediction in zip(axes.ravel(), X_test, y_test, y_pred): Loops over
each test image, its true label, and its predicted label.
ax.set_title(f"True: {label}, Pred: {prediction}") sets the title of the image to show both
the true and predicted labels.
plt.show() displays the plot with test images and their predicted labels.
OUTPUT:
Figure9.1 provides an evaluation of the Gaussian Naive Bayes classifier on the Olivetti Faces
dataset.
Accuracy:
The model correctly predicted 80.83% of the test samples. This is a decent accuracy for a
face recognition task using a simple Naive Bayes classifier.
Classification Report:
The classification report includes precision, recall, and F1-score for each class (person ID).
• Many classes have precision = 1.00, meaning no false positives for those classes.
• Some classes have lower recall (e.g., class 2 with recall = 0.67) → This means some
samples from this class were misclassified.
• Macro avg (0.89 precision, 0.85 recall, 0.83 F1-score) suggests that overall
performance across all classes is reasonably high.
Confusion Matrix: A confusion matrix shows how many samples were correctly classified
vs. misclassified.
Figure9.1 Evaluation of the Gaussian Naive Bayes classifier on the Olivetti Faces
dataset.
Cross-Validation Accuracy:
• Cross-validation (87.25%) is slightly higher than test accuracy (80.83%), meaning the
model performs well on unseen data.
• This suggests that the classifier is generalizing well across different folds of data.
Figure9.2 shows some test images along with their true labels and predicted labels.
Figure9.2: Test images along with their true labels and predicted labels
Observations:
• Correct Predictions: Many images have their true and predicted labels matching
(e.g., True: 18, Pred: 18 and True: 0, Pred: 0). This indicates the model is able to
correctly classify several faces.
• Misclassifications: Some images are misclassified (e.g., True: 5, Pred: 5 vs. True: 5,
Pred: 16). The misclassified images seem to share facial similarities with the
predicted classes (e.g., similar glasses, facial structure).
• Possible Causes of Errors:
o Similar Facial Features: Some individuals may have similar facial
expressions, glasses, or face shapes, leading to confusion.
o Low-Resolution Images: The 64×64 grayscale format may lose fine details,
making classification harder.
o Gaussian Naive Bayes Limitations: It assumes feature independence,
which is not ideal for image-based data.
o Lighting and Expressions: Different lighting conditions and facial
expressions may affect classification.
AIM:
To apply machine learning techniques to the Breast Cancer dataset in order to analyze and
visualize the effectiveness of K-Means clustering for classifying data points into different
clusters, and to evaluate its performance compared to the true labels.
OBJECTIVES:
➢ Data Preparation:
• Load the Breast Cancer dataset and extract features and labels.
• Standardize the features to ensure they have a mean of 0 and variance of 1,
which is essential for many machine learning algorithms.
➢ Clustering Analysis:
• Apply the K-Means clustering algorithm to the standardized dataset.
• Predict cluster labels for each data point using the K-Means algorithm.
➢ Performance Evaluation:
• Evaluate the clustering results by comparing the predicted cluster labels with
the true labels.
• Generate a confusion matrix and classification report to assess the
performance of the clustering algorithm.
➢ Dimensionality Reduction:
• Apply Principal Component Analysis (PCA) to reduce the dimensionality of
the dataset to two principal components.
• Create a DataFrame containing the PCA components, predicted cluster labels,
and true labels for visualization.
➢ Visualization:
• Visualize the K-Means clustering results by plotting the data points in the
PCA-reduced space and coloring them by their cluster labels.
• Visualize the true labels of the data points in the PCA-reduced space for
comparison.
• Plot the K-Means clustering results along with the cluster centroids to provide
a clear representation of the clustering outcome.
PROGRAM:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_breast_cancer
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.metrics import confusion_matrix, classification_report
EXPLANATION:
Import Required Libraries:
• KMeans is used to apply clustering.
• StandardScaler ensures all features have the same scale.
• PCA reduces high-dimensional data to 2D for visualization.
• confusion_matrix and classification_report help assess clustering performance.
• Standardization ensures all features contribute equally by removing mean and scaling
variance.
• StandardScaler() transforms X so that:
o Mean = 0
o Standard Deviation = 1
K-Means is sensitive to feature magnitudes, and standardization prevents bias towards large
values
• Since K-Means is an unsupervised algorithm, it does not know the correct classes.
• We compare true labels (y) vs. cluster assignments (y_kmeans).
• Confusion Matrix: Shows how well clustering matches the true labels.
• Classification Report: Provides precision, recall, F1-score for each cluster.
PCA (Principal Component Analysis) reduces the 30D dataset into 2D (PC1 & PC2). Since
we cannot visualize 30 dimensions, PCA projects data onto two principal axes.
• This block of statements plots a plot that visualizes K-Means cluster assignments.
• Each point represents a tumor sample, colored by its cluster.
• Clusters are plotted using PC1 and PC2.
• Color represents K-Means cluster (not the actual diagnosis).
• Helps understand how well clustering separates malignant and benign case
• This plot is similar to the first one but adds cluster centroids.
• Red "X" markers represent cluster centers (computed from K-Means).
• Centroids show where K-Means thinks the center of each group is.
OUTPUT:
Confusion Matrix:
175: Correctly classified as Class 0 (True Negatives).
37: Incorrectly classified as Class 1 (False Positives).
13: Incorrectly classified as Class 0 (False Negatives).
344: Correctly classified as Class 1 (True Positives).
• There are 37 False Positives → Instances that were actually Class 0 but predicted as
Class 1.
• There are 13 False Negatives → Instances that were actually Class 1 but predicted as
Class 0.
• The model correctly classified 519 out of 569 cases, leading to high accuracy.
Classification Report:
• Precision (0.93 for Class 0, 0.90 for Class 1)
o For Class 0: When the model predicts 0, it is correct 93% of the time.
o For Class 1: When the model predicts 1, it is correct 90% of the time.
• Recall (0.83 for Class 0, 0.96 for Class 1)
o For Class 0: It correctly identifies 83% of all actual 0s.
NALINI H C, DEPARTMENT OF IS&E, RIT, HASSAN 104
MACHINE LEARNING LAB (BCSL606)
Accuracy:
The model correctly classified 91% of all instances, which is quite high. Given the class
imbalance (212 vs. 357 cases), a high accuracy confirms that the clustering model is
performing well.
K-Means successfully found two distinct clusters, meaning the dataset has inherent groupings
that can be separated(Figure10.2). Some misclassifications exist, as seen from overlapping
points in the center. PCA helped visualize the data, but clustering might still have errors due
to the loss of higher-dimensional information. Further evaluation (e.g., comparing clusters
with true labels, using other clustering algorithms like DBSCAN) can improve performance.
Figure10.3 shows that the data is well-structured and can be effectively separated using a
linear boundary. PCA helped in visualization but does not replace the full dataset’s
complexity. K-Means clustering was mostly accurate but had some misclassifications.
Further improvements could be achieved using supervised learning models like SVM,
Random Forest, or Deep Learning instead of K-Means.
K-Means clustering was quite effective in separating the two classes, even though it is an
unsupervised learning technique (Figure10.4). The centroids represent the mean position of
each cluster and indicate the average tumor characteristics for each group. For better
classification accuracy, a supervised learning model (like Logistic Regression or SVM)
would likely perform better.
A: It only works for consistent datasets with no contradictions and assumes a single target
concept.
Q12: How does Find-S handle negative examples?
A: It completely ignores negative examples and only generalizes from positive examples.
Q13: What is the significance of the ‘k’ value in KNN?
A: The value of k determines the number of nearest neighbors considered; a low value
increases sensitivity to noise, while a high value may oversmooth the decision boundary.
Q14: What distance metrics are commonly used in KNN?
A: Euclidean, Manhattan, and Minkowski distances.
Q15: How does KNN handle imbalanced datasets?
A: Weighting neighbors by distance or using oversampling techniques can improve results on
imbalanced data.
Q16: How is Locally Weighted Regression different from standard regression?
A: LWR gives higher weights to nearby data points, making predictions more localized.
Q17: What kernel functions are used in LWR?
A: Common choices are Gaussian, Epanechnikov, and Triangular kernels.
Q18: When should LWR be preferred over standard linear regression?
A: When the relationship between variables is non-linear and varies across the dataset.
Q19: What is the difference between linear and polynomial regression?
A: Linear regression fits a straight line, whereas polynomial regression fits a curved line
by introducing higher-degree terms.
Q20: How do we evaluate regression models?
A: Metrics such as Mean Squared Error (MSE), R-squared, and Mean Absolute Error
(MAE).
Q21: Why do we use feature scaling in regression?
A: Scaling ensures equal weight for all features, preventing domination by features with
larger magnitudes.
Q22: How does a decision tree make predictions?
A: It splits data based on feature conditions using criteria like Gini impurity or information
gain.
Q23: What are the advantages and disadvantages of decision trees?
A:
• Advantages: Easy to interpret, handles both numerical & categorical data.