ML Lab Manual
ML Lab Manual
6th-SEM VTU
Lab Manual 2025
Created and curated by Certisured for VTU Syllabus
9606698866 | 8988897979
[email protected]
Table of Contents
Introduction
Data visualization is a crucial step in exploratory data analysis (EDA), enabling data scientists
to understand the distribution and spread of numerical features. Two widely used
visualization techniques for analyzing numerical data are histograms and box plots. These
plots help identify patterns, trends, and potential anomalies in datasets, making them
valuable tools for data preprocessing and feature engineering.
Distribution
In statistics, distribution refers to how data values are spread across a range. Understanding
the distribution of numerical features in a dataset helps in identifying patterns, detecting
outliers, and making informed decisions. The two primary ways to visualize distribution are
histograms and box plots.
1. Histograms
A histogram is a graphical representation of the distribution of a numerical feature. It divides
the data into bins (intervals) and counts the number of observations in each bin.
Importance of Histograms
1
First Quartile (Q1): 25th percentile.
Identifying Outliers: Points lying outside the whiskers indicate potential outliers.
Comparing Distributions: Box plots allow easy comparison of multiple features or
groups.
Measuring Data Spread: The length of the box and whiskers provides insight into data
variability.
Understanding Skewness: If the median is closer to one end, the distribution may be
skewed.
Outlier
An outlier is an observation or data point that significantly differs from the rest of the data in
a dataset. Outliers can skew statistical analyses and distort the interpretation of results,
making it important to identify and understand them.
2
Plotting the data using graphs like box plots, scatter plots, or histograms can reveal
observations that stand out from the majority.
Statistical Methods:
Z-Score: Identifying data points with z-scores beyond a certain threshold (e.g., |z| >
3) as potential outliers.
Z = (x-µ)/σ
Interquartile Range (IQR): Using the IQR to identify observations outside a defined
range.
IQR = Q3 - Q1
LF = Q1 - (1.5*IQR)
UF = Q3 + (1.5*IQR)
Removing Outliers:
Removing outliers involves excluding extreme values from the dataset before analysis.
Common methods include using statistical criteria (e.g., Z-scores, IQR) to identify and
exclude observations beyond a certain threshold.
Reduces the impact of extreme values on summary statistics and model results
Loss of information: Excluding outliers may discard meaningful data points.
Transformation:
3
About Dataset
Context
This is the dataset used in the second chapter of Aurélien Géron's recent book 'Hands-On
Machine learning with Scikit-Learn and TensorFlow'. It serves as an excellent introduction to
implementing machine learning algorithms because it requires rudimentary data cleaning,
has an easily understandable list of variables and sits at an optimal size between being to
toyish and too cumbersome.
The data contains information from the 1990 California census. So although it may not help
you with predicting current housing prices like the Zillow Zestimate dataset, it does provide
an accessible introductory dataset for teaching people about the basics of machine learning.
Content
The data pertains to the houses found in a given California district and some summary stats
about them based on the 1990 census data. Be warned the data aren't cleaned so there are
some preprocessing steps required! The columns are as follows, their names are pretty self
explanitory:
longitude
latitude
housing_median_age
total_rooms
total_bedrooms
population
households
median_income
median_house_value (Target)
ocean_proximity
4
Pandas and Numpy have been used for Data Manipulation and numerical Calculations
import warnings
warnings.filterwarnings('ignore')
In [20]: df.head()
In [21]: df.shape
In [22]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 longitude 20640 non-null float64
1 latitude 20640 non-null float64
2 housing_median_age 20640 non-null float64
3 total_rooms 20640 non-null float64
4 total_bedrooms 20433 non-null float64
5 population 20640 non-null float64
6 households 20640 non-null float64
7 median_income 20640 non-null float64
8 median_house_value 20640 non-null float64
9 ocean_proximity 20640 non-null object
dtypes: float64(9), object(1)
memory usage: 1.6+ MB
In [23]: df.nunique()
5
Out[23]: longitude 844
latitude 862
housing_median_age 52
total_rooms 5926
total_bedrooms 1923
population 3888
households 1815
median_income 12928
median_house_value 3842
ocean_proximity 5
dtype: int64
Data Cleaning
In [24]: df.isnull().sum()
Out[24]: longitude 0
latitude 0
housing_median_age 0
total_rooms 0
total_bedrooms 207
population 0
households 0
median_income 0
median_house_value 0
ocean_proximity 0
dtype: int64
In [25]: df.duplicated().sum()
Out[25]: 0
In [26]: df['total_bedrooms'].median()
Out[26]: 435.0
Feature Engineering
In [28]: for i in df.iloc[:,2:7]:
df[i] = df[i].astype('int')
In [29]: df.head()
6
Out[29]: longitude latitude housing_median_age total_rooms total_bedrooms population hou
Disciptive Statistics
In [30]: df.describe().T
Uni-Variate Analysis
In [32]: for col in Numerical:
plt.figure(figsize=(10, 6))
plt.show()
7
8
9
10
11
1. Longitude:
The dataset contains houses located in specific regions (possibly coastal areas or urban
zones) as indicated by the bimodal peaks. Houses are not uniformly distributed across
all longitudes.
2. Latitude:
Most houses are relatively older, with the majority concentrated in a specific range of
median ages. This might imply that housing development peaked during certain
decades.
4. Total Rooms:
The highly skewed distribution shows most houses have a lower total number of rooms.
A few properties with a very high number of rooms could represent outliers (e.g.,
mansions or multi-unit buildings).
5. Median Income:
Most households fall within a low-to-mid income bracket. The steep decline after the
peak suggests a small proportion of high-income households in the dataset.
12
6. Population:
Most areas in the dataset have a relatively low population. However, there are some
highly populated areas, as evidenced by the long tail. These may represent urban
centers.
The sharp peak at the end of the histogram suggests that house prices in the dataset
are capped at a maximum value, which could limit the variability in predictions.
sns.boxplot(df[col], color='blue')
plt.title(col)
plt.ylabel(col)
plt.show()
13
14
15
16
Outlier Analysis for Each Feature:
1. Total Rooms: There are numerous data points above the upper whisker, indicating a
significant number of outliers.
2. Total Bedrooms: Numerous data points above the upper whisker indicate a significant
presence of outliers with very high total_bedrooms values.
17
3. Population: There are numerous outliers above the upper whisker, with extreme
population values reaching beyond 35,000.
4. Households There is a significant number of outliers above the upper whisker. These
values represent areas with an unusually high number of households.
5. Median Income: There are numerous data points above the upper whisker, marked as
circles. These are considered potential outliers.
6. Median House Value: A small cluster of outliers is visible near the maximum
value of 500,000.
18
Experiment 2
Develop a program to Compute the correlation matrix to
understand the relationships between pairs of features.
Visualize the correlation matrix using a heatmap to know
which variables have strong positive/negative correlations.
Create a pair plot to visualize pairwise relationships
between features. Use California Housing dataset.
Introduction
In data analysis and machine learning, understanding the relationships between features is
crucial for feature selection, multicollinearity detection, and data interpretation. Correlation
and pair plots are two essential techniques to analyze these relationships.
1. Correlation Matrix
A correlation matrix is a table showing correlation coefficients between variables. It helps in
understanding how strongly features are related to each other.
Types of Correlation
Positive Correlation (+1 to 0): As one feature increases, the other also increases.
Negative Correlation (0 to -1): As one feature increases, the other decreases.
No Correlation (0): No linear relationship between the variables.
19
3. Pair Plot
A pair plot (also known as a scatterplot matrix) is a collection of scatter plots for every pair
of numerical variables in the dataset. It helps in visualizing relationships between variables.
# Convert to DataFrame
df = pd.DataFrame(data.data, columns=data.feature_names)
df['Target'] = data.target # Adding the target variable (median house value)
20
In [4]: # Basic Data Exploration
print("\nBasic Information about Dataset:")
print(df.info()) # Overview of dataset
print("\nFirst Five Rows of Dataset:")
print(df.head()) # Display first few rows
Longitude Target
0 -122.23 4.526
1 -122.22 3.585
2 -122.24 3.521
3 -122.25 3.413
4 -122.25 3.422
21
Summary Statistics:
MedInc HouseAge AveRooms AveBedrms Population \
count 20640.000000 20640.000000 20640.000000 20640.000000 20640.000000
mean 3.870671 28.639486 5.429000 1.096675 1425.476744
std 1.899822 12.585558 2.474173 0.473911 1132.462122
min 0.499900 1.000000 0.846154 0.333333 3.000000
25% 2.563400 18.000000 4.440716 1.006079 787.000000
50% 3.534800 29.000000 5.229129 1.048780 1166.000000
75% 4.743250 37.000000 6.052381 1.099526 1725.000000
max 15.000100 52.000000 141.909091 34.066667 35682.000000
The summary statistics table provides key percentiles and other descriptive metrics
for each numerical feature:
- **25% (First Quartile - Q1):** This represents the value below which 25% of the da
ta falls. It helps in understanding the lower bound of typical data values.
- **50% (Median - Q2):** This is the middle value when the data is sorted. It provid
es the central tendency of the dataset.
- **75% (Third Quartile - Q3):** This represents the value below which 75% of the da
ta falls. It helps in identifying the upper bound of typical values in the dataset.
- These percentiles are useful for detecting skewness, data distribution, and identi
fying potential outliers (values beyond Q1 - 1.5*IQR or Q3 + 1.5*IQR).
22
Missing Values in Each Column:
MedInc 0
HouseAge 0
AveRooms 0
AveBedrms 0
Population 0
AveOccup 0
Latitude 0
Longitude 0
Target 0
dtype: int64
23
In [9]: # Correlation Matrix
plt.figure(figsize=(10, 6))
corr_matrix = df.corr()
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt='.2f')
plt.title("Feature Correlation Heatmap")
plt.show()
24
plt.show()
Key Insights:
1. The dataset has 20640 rows and 9 columns.
2. No missing values were found in the dataset.
3. Histograms show skewed distributions in some features like 'MedInc'.
4. Boxplots indicate potential outliers in 'AveRooms' and 'AveOccup'.
5. Correlation heatmap shows 'MedInc' has the highest correlation with house prices.
25
Experiment 3
Develop a program to implement Principal Component
Analysis (PCA) for reducing the dimensionality of the Iris
dataset from 4 features to 2.
Importance of PCA
Reduces computational complexity by lowering the number of features.
Helps in visualizing high-dimensional data.
Removes redundant or correlated features, improving model performance.
Reduces overfitting by eliminating noise in the data.
1. Standardization: The data is normalized so that all features have a mean of zero and a
standard deviation of one.
2. Compute the Covariance Matrix: This step helps in understanding how different
features relate to each other.
3. Eigenvalue & Eigenvector Calculation: Eigenvectors represent the direction of the
new feature axes, and eigenvalues determine the importance of these axes.
4. Selecting Principal Components: The eigenvectors corresponding to the highest
eigenvalues are chosen to form the new feature space.
5. Transforming Data: The original dataset is projected onto the new feature space with
reduced dimensions.
26
The Iris dataset consists of 4 numerical features (sepal length, sepal width, petal length,
petal width) used to classify flowers into 3 species (Setosa, Versicolor, and Virginica).
If PC1 explains 70% and PC2 explains 20%, then the first two principal components
capture 90% of the variance in the dataset.
Benefits of PCA
Feature Reduction: Reduces the number of variables without significant loss of
information.
Noise Reduction: Removes redundant or less informative features.
Improved Visualization: Enables easier interpretation of high-dimensional data.
Better Model Performance: Enhances efficiency in training machine learning models.
#
# The goal of using PCA in this exercise is to reduce these four features into two
# This will help in visualizing the data better and understanding its underlying st
#
# Since humans struggle to visualize data in more than three dimensions, reducing t
27
# retain the most important patterns while making it easier to interpret. PCA helps
# preserving as much variance as possible.
The Iris dataset consists of 4 features, which represent different physical characteristics of iris
flowers:
These features were chosen because they effectively differentiate between the three iris
species (Setosa, Versicolor, and Virginica).
These features are chosen arbitrarily for visualization, but all four features are used in the
PCA computation. Why is the Iris Dataset Important?
Since the dataset contains three classes (Setosa, Versicolor, and Virginica), PCA helps
visualize how well the classes can be separated in a lower-dimensional space.
28
# Step 2: Standardizing the Data
# PCA works best when data is standardized (mean = 0, variance = 1)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
29
# Step 9: Visualizing Eigenvectors Superimposed on 3D Data
fig = plt.figure(figsize=(8, 6))
ax = fig.add_subplot(111, projection='3d')
for i in range(len(colors)):
ax.scatter(X_scaled[y == i, 0], X_scaled[y == i, 1], X_scaled[y == i, 2], color
for i in range(3): # Plot first three eigenvectors
ax.quiver(0, 0, 0, eigenvectors[i, 0], eigenvectors[i, 1], eigenvectors[i, 2],
ax.set_xlabel('Sepal Length')
ax.set_ylabel('Sepal Width')
ax.set_zlabel('Petal Length')
ax.set_title('3D Data with Eigenvectors')
plt.legend()
plt.show()
# Recap:
# - The Iris dataset is historically important for testing classification models.
# - We standardized the data to ensure fair comparison across features.
# - We calculated the covariance matrix, eigenvalues, and eigenvectors.
# - PCA is built on SVD, which decomposes data into important components.
# - We visualized the original 3D data and superimposed eigenvectors.
# - We applied PCA to reduce the dimensionality from 4D to 2D.
# - Finally, we visualized the transformed data in 2D space.
30
Singular Values: [20.92306556 11.7091661 4.69185798 1.76273239]
Explained Variance by PC1: 0.73
Explained Variance by PC2: 0.23
In [ ]:
31
Experiment 4
For a given set of training data examples stored in a .CSV
file, implement and demonstrate the Find-S algorithm to
output a description of the set of all hypotheses consistent
with the training examples.
Start with the most specific hypothesis (i.e., all attributes set to the most restrictive
value).
2. Iterate Through Each Training Example:
After processing all positive examples, the final hypothesis represents the most
specific generalization of the training data.
32
Training Dataset
The dataset contains five attributes:
No Bachelors Java 28 No
No Masters Python 35 No
3. Final Hypothesis
The final hypothesis is the most specific generalization covering all positive examples.
It represents a logical rule derived from the dataset.
Limitations of Find-S
Only considers positive examples: It ignores negative examples, which may lead to an
incomplete hypothesis.
Cannot handle noise or missing data: Works only when training data is perfect.
Finds only one hypothesis: Does not provide alternative consistent hypotheses.
33
The Find-S algorithm is a simple machine-learning algorithm used in concept learning. It
finds the most specific hypothesis that is consistent with all positive examples in a given
training dataset. The algorithm assumes:
In [5]: print(data)
return hypothesis
34
In [ ]:
35
Experiment 5
Develop a program to implement k-Nearest Neighbour
algorithm to classify the randomly generated 100 values of
x in the range of [0,1]. Perform the following based on
dataset generated.
Importance of k-NN
Simple and effective for classification tasks.
Non-parametric (makes no assumptions about the data distribution).
Handles multi-class classification with ease.
36
Working of the k-NN Algorithm
1. Choose a Value for k:
A small k (e.g., k=1) makes the model sensitive to noise and results in high variance.
A large k (e.g., k=30) smooths the decision boundary but may lead to high bias.
The optimal k is usually found by cross-validation.
2. Compute Distance Between Data Points: The algorithm relies on a distance metric to
determine similarity between data points. Common distance measures include:
`- Euclidean Distance (Most commonly used)
Manhattan Distance
Minkowski Distance
Weighted Voting: Closer neighbors have higher influence on the prediction than
farther neighbors.
37
Classification is performed for multiple values of k :
k = 1, 2, 3, 4, 5, 20, 30
Observing how different values of k affect classification accuracy and decision
boundaries.
Advantages of k-NN
✔ Simple and easy to implement.
✔ No training phase—all computation happens during prediction.
✔ Works well for multi-class classification problems.
✔ Can model complex decision boundaries when k is appropriately chosen.
Limitations of k-NN
❌ Computationally expensive for large datasets.
❌ Performance depends on the choice of k.
❌ Sensitive to irrelevant or redundant features.
❌ Memory-intensive since all training data needs to be stored.
Problem Explanation
The goal is to implement the k-Nearest Neighbors (KNN) algorithm to classify 100 randomly
generated values in the range [0,1]. The classification process involves the following steps:
Create 100 random values in the range [0,1] The first 50 values are manually labeled based
on a given rule: If 𝑥𝑖≤0.5 assign it to Class1.Otherwise, assign it to Class2. Classifying the
38
Remaining 50 Values Using KNN:
The next 50 values (𝑥51to 𝑥100) are unlabeled. We use the KNN algorithm to classify these
values based on their nearest neighbors among the first 50 labeled points.
import warnings
warnings.filterwarnings('ignore')
In [3]: labels = []
for i in values[:50]:
if i <=0.5:
labels.append('Class1')
else:
labels.append('Class2')
In [5]: print(labels)
In [6]: data = {
"Point": [f"x{i+1}" for i in range(100)],
"Value": values,
"Label": labels
}
In [11]: df = pd.DataFrame(data)
39
df.head()
0 x1 0.374540 Class1
1 x2 0.950714 Class2
2 x3 0.731994 Class2
3 x4 0.598658 Class2
4 x5 0.156019 Class1
In [12]: df.nunique()
In [13]: df.shape
Out[13]: (100, 3)
40
In [15]: print("\nSummary Statistics:")
df.describe().T
Summary Statistics:
Out[15]: count mean std min 25% 50% 75% max
In [25]: Summary_Statistics="""
- The 'Value' column has a mean of approximately 0.47, indicating that the values a
- The standard deviation of the 'Value' column is approximately 0.29, showing a mod
- The minimum value in the 'Value' column is approximately 0.0055, and the maximum
- The first quartile (25th percentile) is approximately 0.19, the median (50th perc
print(Summary_Statistics)
- The 'Value' column has a mean of approximately 0.47, indicating that the values ar
e uniformly distributed.
- The standard deviation of the 'Value' column is approximately 0.29, showing a mode
rate spread around the mean.
- The minimum value in the 'Value' column is approximately 0.0055, and the maximum v
alue is approximately 0.9869.
- The first quartile (25th percentile) is approximately 0.19, the median (50th perce
ntile) is approximately 0.47, and the third quartile (75th percentile) is approximat
ely 0.73.
41
In [19]: # Inference for the above graph
inference = """
- The histograms for the distribution of features show that the values are uniforml
- This is expected as the values were generated using a uniform random distribution
- There are no significant outliers or skewness in the data, indicating that the da
"""
print(inference)
- The histograms for the distribution of features show that the values are uniformly
distributed across the range [0, 1].
- This is expected as the values were generated using a uniform random distribution.
- There are no significant outliers or skewness in the data, indicating that the dat
aset is well-balanced.
42
In [94]: # Step 2: Perform KNN classification for different values of k
k_values = [1, 2, 3, 4, 5, 20, 30]
results = {}
accuracies = {}
# Calculate accuracy
accuracy = accuracy_score(true_labels, predictions) * 100
accuracies[k] = accuracy
print(f"Accuracy for k={k}: {accuracy:.2f}%")
In [97]: print(predictions)
43
['Class2' 'Class2' 'Class2' 'Class2' 'Class2' 'Class2' 'Class1' 'Class1'
'Class1' 'Class1' 'Class1' 'Class1' 'Class2' 'Class1' 'Class1' 'Class2'
'Class1' 'Class2' 'Class1' 'Class2' 'Class2' 'Class1' 'Class1' 'Class2'
'Class2' 'Class2' 'Class2' 'Class1' 'Class1' 'Class1' 'Class2' 'Class2'
'Class1' 'Class1' 'Class1' 'Class1' 'Class2' 'Class2' 'Class2' 'Class1'
'Class1' 'Class2' 'Class2' 'Class2' 'Class2' 'Class1' 'Class2' 'Class1'
'Class1' 'Class1']
44
Out[98]: Point Value Label_k1 Label_k2 Label_k3 Label_k4 Label_k5 Label_k20 Label_k3
45
Point Value Label_k1 Label_k2 Label_k3 Label_k4 Label_k5 Label_k20 Label_k3
Key Insights:
The KNN classification was performed for different values of k: 1, 2, 3, 4, 5, 20, and 30.
46
The accuracy of the classification varied with the value of k.
For smaller values of k (1, 2, 3, 4, 5), the accuracy was relatively high, indicating that the
model was able to classify the points correctly.
As the value of k increased to 20 and 30, the accuracy decreased, suggesting that the
model's performance deteriorated with higher values of k.
This is expected as higher values of k can lead to over-smoothing, where the model
becomes less sensitive to the local structure of the data.
Overall, the KNN classifier performed well for smaller values of k, with the highest
accuracy observed for k=1.
47
Experiment 6
Implement the non-parametric Locally Weighted Regression
algorithm in order to fit data points. Select appropriate data
set for your experiment and draw graphs
A kernel function (e.g., Gaussian kernel) is used to assign weights to data points:
Here, τ (tau) is the bandwidth parameter that controls the locality of weighting.
For a given query point x , assign weights to training points based on proximity.
3. Fit a Local Model
Solve a weighted least squares problem using the locally weighted dataset.
4. Make Predictions
48
Compute the predicted value at x using the locally trained model.
Dataset Selection
For this experiment, we need a dataset with a clear non-linear relationship between
independent and dependent variables. Some possible datasets include:
49
Limitations of Locally Weighted Regression
❌ Computationally expensive: Must compute a separate model for each query point.
❌ Sensitive to bandwidth parameter ((\tau)): Choosing the wrong value can lead to
overfitting or underfitting.
❌ Not suitable for large datasets: As the dataset size increases, the algorithm becomes
impractical due to high computation time.
# Dataset
X = np.array([1, 2, 3, 4, 5])
y = np.array([1, 2, 1.3, 3.75, 2.25])
# Query point
x_query = 3 # Point at which we perform LWR
# Bandwidth parameter
tau = 1.0
# Compute prediction
y_pred = locally_weighted_regression(X, y, x_query, tau)
# Visualizing
plt.figure(figsize=(8, 6))
plt.scatter(X, y, color='blue', label='Data Points')
plt.scatter(x_query, y_pred, color='red', label=f'Prediction at x={x_query}')
50
plt.legend()
plt.show()
51
theta = np.linalg.inv(X_b.T @ W @ X_b) @ X_b.T @ W @ y
# Complex Dataset
X = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
y = np.array([1, 3, 2, 4, 3.5, 5, 6, 7, 6.5, 8])
# Visualizing
plt.figure(figsize=(10, 6))
plt.scatter(X, y, color='blue', label='Data Points')
plt.plot(X_query, y_lin, color='black', linestyle='dashed', label='Simple Linear Re
plt.plot(X_query, y_lwr, color='red', label='Locally Weighted Regression')
plt.title("Comparison: Simple Linear Regression vs. Locally Weighted Regression")
plt.xlabel("X")
plt.ylabel("Y")
plt.legend()
plt.show()
52
In [5]: import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
# Complex Dataset
X = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
y = np.array([1, 3, 2, 4, 3.5, 5, 6, 7, 6.5, 8])
# Visualizing
plt.figure(figsize=(12, 8))
plt.scatter(X, y, color='blue', label='Data Points')
plt.plot(X_query, y_lin, color='black', linestyle='dashed', label='Simple Linear Re
53
The tau (τ) parameter in your code is the bandwidth for the Gaussian kernel, which controls
how much influence nearby points have in the Locally Weighted Regression (LWR). Here's
what it does:
54
Experiment 6 A
Comparision of Linear Regression,Polynomial
Regression,Locally Weighted Regression (LWR)
Introduction
Regression analysis is a fundamental technique in machine learning and statistics used for
modeling the relationship between a dependent variable and one or more independent
variables. Different types of regression models are used depending on the nature of the data
and the complexity of the relationship. The three primary regression techniques discussed
here are:
Linear Regression
Polynomial Regression
Each method has its own advantages and is suited for specific types of data and problem
domains.
Computational
Low Moderate High
Cost
55
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
from scipy.spatial.distance import cdist
# Load datasets
df_linear = pd.read_csv("linear_dataset.csv")
df_lwr = pd.read_csv("lwr_dataset.csv")
df_poly = pd.read_csv("polynomial_dataset.csv")
# Linear Regression
def linear_regression(df):
X, y = df[['X']], df['Y']
model = LinearRegression()
model.fit(X, y)
y_pred = model.predict(X)
plt.scatter(X, y, label='Data')
plt.plot(X, y_pred, color='red', label='Linear Regression')
plt.legend()
plt.title("Linear Regression")
plt.show()
linear_regression(df_linear)
for x in X_range:
x_vec = np.array([1, x]) # Intercept term
weights = gaussian_kernel(x, X_train[:, 1:], tau).flatten()
W = np.diag(weights)
56
plt.scatter(X_train[:, 1], y_train, label='Data')
plt.plot(X_range, y_pred, color='red', label='LWR')
plt.legend()
plt.title("Locally Weighted Regression")
plt.show()
locally_weighted_regression(df_lwr[['X']].values, df_lwr['Y'].values)
polynomial_regression(df_poly, degree=3)
57
In [ ]:
58
Experiment 7 A:
Develop a program to demonstrate the working of Linear
Regression and Polynomial Regression. Use Boston Housing
Dataset for Linear Regression and Auto MPG Dataset (for
vehicle fuel efficiency prediction) for Polynomial
Regression.
Linear Regression
Definition
Linear Regression models the relationship between an independent variable ( x ) and a
dependent variable ( y ) using a straight-line equation:
y = mx + c
where:
59
2. Compute the cost function: Measures how well the model fits the data using Mean
Squared Error (MSE)
3. Optimize the model parameters: Uses Gradient Descent or other optimization
techniques to find the best m and c .
import warnings
warnings.filterwarnings('ignore')
In [4]: data.head()
60
Out[4]: CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B L
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90
In [5]: data.shape
In [6]: data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 CRIM 486 non-null float64
1 ZN 486 non-null float64
2 INDUS 486 non-null float64
3 CHAS 486 non-null float64
4 NOX 506 non-null float64
5 RM 506 non-null float64
6 AGE 486 non-null float64
7 DIS 506 non-null float64
8 RAD 506 non-null int64
9 TAX 506 non-null int64
10 PTRATIO 506 non-null float64
11 B 506 non-null float64
12 LSTAT 486 non-null float64
13 MEDV 506 non-null float64
dtypes: float64(12), int64(2)
memory usage: 55.5 KB
The dataset contains 506 entries and 14 columns, with 6 columns (CRIM, ZN, INDUS,
CHAS, AGE, LSTAT) having 20 missing values each.
Most columns are continuous (float64), while RAD and TAX are discrete (int64).
MEDV (median home value) is the target variable, likely influenced by features like RM
(average rooms) and LSTAT (lower-status population).
Missing values need to be addressed through imputation or by dropping rows with
missing data.
Exploratory analysis and modeling can help understand feature relationships and
predict MEDV.
In [7]: data.nunique()
61
Out[7]: CRIM 484
ZN 26
INDUS 76
CHAS 2
NOX 81
RM 446
AGE 348
DIS 412
RAD 9
TAX 66
PTRATIO 46
B 357
LSTAT 438
MEDV 229
dtype: int64
In [8]: data.CHAS.unique()
In [9]: data.ZN.unique()
Out[9]: array([ 18. , 0. , 12.5, 75. , 21. , 90. , 85. , 100. , 25. ,
17.5, 80. , nan, 28. , 45. , 60. , 95. , 82.5, 30. ,
22. , 20. , 40. , 55. , 52.5, 70. , 34. , 33. , 35. ])
Data Cleaning
Checking Null values
data.isnull() - Returns a DataFrame of the same shape as data, where each element is True if
it's NaN and False otherwise.
.sum() - Sums up the True values (which are treated as 1 in Python) column-wise, giving the
total count of missing values for each column.
In [10]: data.isnull().sum()
Out[10]: CRIM 20
ZN 20
INDUS 20
CHAS 20
NOX 0
RM 0
AGE 20
DIS 0
RAD 0
TAX 0
PTRATIO 0
B 0
LSTAT 20
MEDV 0
dtype: int64
62
In [11]: data.duplicated().sum()
Out[11]: 0
In [12]: df = data.copy()
In [14]: df.isnull().sum()
Out[14]: CRIM 0
ZN 0
INDUS 0
CHAS 0
NOX 0
RM 0
AGE 0
DIS 0
RAD 0
TAX 0
PTRATIO 0
B 0
LSTAT 0
MEDV 0
dtype: int64
In [15]: df.head()
Out[15]: CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B L
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90
In [17]: df.describe().T
63
Out[17]: count mean std min 25% 50% 75%
plt.subplot(1, 2, 1)
df[i].hist(bins=20, alpha=0.5, color='b',edgecolor='black')
plt.title(f'Histogram of {i}')
plt.xlabel(i)
plt.ylabel('Frequency')
plt.subplot(1, 2, 2)
plt.boxplot(df[i], vert=False)
plt.title(f'Boxplot of {i}')
plt.show()
64
65
66
67
In [19]: corr = df.corr(method='pearson')
plt.figure(figsize=(10, 8))
sns.heatmap(corr, annot=True, cmap="coolwarm", fmt=".2f", linewidths=0.5)
plt.xticks(rotation=90, ha='right')
plt.yticks(rotation=0)
plt.title("Correlation Matrix Heatmap")
plt.show()
68
In [20]: X = df.drop('MEDV', axis=1) # All columns except 'MEDV'
y = df['MEDV'] # Target variable
In [22]: # Split the data into training (80%) and testing (20%) sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled , y, test_size=0.2, ra
69
In [23]: # Initialize the linear regression model
model = LinearRegression()
Out[23]: ▾ LinearRegression
LinearRegression()
70
Experiment 7 B
Develop a program to demonstrate the working of Linear
Regression and Polynomial Regression. Use Boston Housing
Dataset for Linear Regression and Auto MPG Dataset (for
vehicle fuel efficiency prediction) for Polynomial
Regression.
Polynomial Regression
Definition
Polynomial regression is a type of regression analysis used in statistics and machine learning
when the relationship between the independent variable (input) and the dependent variable
(output) is not linear. While simple linear regression models the relationship as a straight
line, polynomial regression allows for more flexibility by fitting a polynomial equation to the
data.
71
Applications of Polynomial Regression
Predicting fuel efficiency based on vehicle characteristics.
Modeling economic growth trends over time.
Analyzing the effect of temperature on crop yields.
Assumes a straight-line
Relationship Type Captures curved relationships
relationship
Complexity Simple and easy to interpret More flexible but may overfit
import warnings
warnings.filterwarnings("ignore")
In [2]: sns.get_dataset_names()
72
Out[2]: ['anagrams',
'anscombe',
'attention',
'brain_networks',
'car_crashes',
'diamonds',
'dots',
'dowjones',
'exercise',
'flights',
'fmri',
'geyser',
'glue',
'healthexp',
'iris',
'mpg',
'penguins',
'planets',
'seaice',
'taxis',
'tips',
'titanic']
In [4]: data.head()
c
0 18.0 8 307.0 130.0 3504 12.0 70 usa
p
2 18.0 8 318.0 150.0 3436 11.0 70 usa
In [5]: data.shape
Out[5]: (398, 9)
In [6]: data.info()
73
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 398 entries, 0 to 397
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 mpg 398 non-null float64
1 cylinders 398 non-null int64
2 displacement 398 non-null float64
3 horsepower 392 non-null float64
4 weight 398 non-null int64
5 acceleration 398 non-null float64
6 model_year 398 non-null int64
7 origin 398 non-null object
8 name 398 non-null object
dtypes: float64(4), int64(3), object(2)
memory usage: 28.1+ KB
In [7]: data.nunique()
In [8]: data.horsepower.unique()
Out[8]: array([130., 165., 150., 140., 198., 220., 215., 225., 190., 170., 160.,
95., 97., 85., 88., 46., 87., 90., 113., 200., 210., 193.,
nan, 100., 105., 175., 153., 180., 110., 72., 86., 70., 76.,
65., 69., 60., 80., 54., 208., 155., 112., 92., 145., 137.,
158., 167., 94., 107., 230., 49., 75., 91., 122., 67., 83.,
78., 52., 61., 93., 148., 129., 96., 71., 98., 115., 53.,
81., 79., 120., 152., 102., 108., 68., 58., 149., 89., 63.,
48., 66., 139., 103., 125., 133., 138., 135., 142., 77., 62.,
132., 84., 64., 74., 116., 82.])
Data Cleaning
In [9]: data.isnull().sum()
74
Out[9]: mpg 0
cylinders 0
displacement 0
horsepower 6
weight 0
acceleration 0
model_year 0
origin 0
name 0
dtype: int64
In [10]: data.duplicated().sum()
Out[10]: 0
Data Handling
In [11]: df = data.copy()
Discriptive Statistics
In [13]: df.describe().T
EDA
In [14]: numerical = df.select_dtypes(include=['int','float']).columns
categorical = df.select_dtypes(include=['object']).columns
print(numerical)
print(categorical)
75
In [15]: for i in numerical:
plt.figure(figsize=(10,4))
plt.subplot(1, 2, 1)
df[i].hist(bins=20, alpha=0.5, color='b',edgecolor='black')
plt.title(f'Histogram of {i}')
plt.xlabel(i)
plt.ylabel('Frequency')
plt.subplot(1, 2, 2)
plt.boxplot(df[i], vert=False)
plt.title(f'Boxplot of {i}')
plt.show()
76
In [16]: import seaborn as sns
for col in categorical:
plt.figure(figsize=(6, 6))
sns.countplot(x=col, data=df, order=df[col].value_counts().sort_values().head(1
plt.title(f'Countplot of {col}')
plt.xticks(rotation=90)
plt.show()
77
78
In [17]: corr_data = df[numerical].corr(method='pearson')
plt.figure(figsize=(10, 8))
sns.heatmap(corr_data, annot=True, cmap="coolwarm", fmt=".2f", linewidths=0.5)
plt.xticks(rotation=90, ha='right')
plt.yticks(rotation=0)
plt.title("Correlation Matrix Heatmap")
plt.show()
79
model.fit(X_poly_train, y_train)
Out[21]: ▾ LinearRegression
LinearRegression()
80
Experiment 8
Develop a program to demonstrate the working of the
decision tree algorithm. Use Breast Cancer Data set for
building the decision tree and applying this knowledge to
classify a new sample.
Decision trees work by recursively splitting data into subsets based on the most significant
feature, ensuring maximum information gain at each step.
Gini Impurity
Gini = 1- ∑Pi2
Measures the uncertainty in a dataset and selects splits that maximize information gain.
Chi-Square Test
Evaluates the statistical significance of the feature split.
81
The dataset is divided into subsets based on the selected feature.
The process continues recursively until:
A stopping condition is met (e.g., pure classification, max depth).
The tree reaches a predefined depth.
3. Making Predictions
For a new sample, traverse the tree from the root to a leaf node.
The leaf node contains the predicted class label.
Pre-Pruning: Stop the tree early using conditions (e.g., min samples per split).
Post-Pruning: Remove unnecessary branches after the tree is built.
2. Setting Tree Depth
82
In [40]: # Importing necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
In [11]: data.head()
In [7]: data.shape
In [12]: data.info()
83
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 32 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 569 non-null int64
1 diagnosis 569 non-null object
2 radius_mean 569 non-null float64
3 texture_mean 569 non-null float64
4 perimeter_mean 569 non-null float64
5 area_mean 569 non-null float64
6 smoothness_mean 569 non-null float64
7 compactness_mean 569 non-null float64
8 concavity_mean 569 non-null float64
9 concave_points_mean 569 non-null float64
10 symmetry_mean 569 non-null float64
11 fractal_dimension_mean 569 non-null float64
12 radius_se 569 non-null float64
13 texture_se 569 non-null float64
14 perimeter_se 569 non-null float64
15 area_se 569 non-null float64
16 smoothness_se 569 non-null float64
17 compactness_se 569 non-null float64
18 concavity_se 569 non-null float64
19 concave_points_se 569 non-null float64
20 symmetry_se 569 non-null float64
21 fractal_dimension_se 569 non-null float64
22 radius_worst 569 non-null float64
23 texture_worst 569 non-null float64
24 perimeter_worst 569 non-null float64
25 area_worst 569 non-null float64
26 smoothness_worst 569 non-null float64
27 compactness_worst 569 non-null float64
28 concavity_worst 569 non-null float64
29 concave_points_worst 569 non-null float64
30 symmetry_worst 569 non-null float64
31 fractal_dimension_worst 569 non-null float64
dtypes: float64(30), int64(1), object(1)
memory usage: 142.4+ KB
In [13]: data.diagnosis.unique()
Data Preprocessing
Data Cleaning
In [14]: data.isnull().sum()
84
Out[14]: id 0
diagnosis 0
radius_mean 0
texture_mean 0
perimeter_mean 0
area_mean 0
smoothness_mean 0
compactness_mean 0
concavity_mean 0
concave_points_mean 0
symmetry_mean 0
fractal_dimension_mean 0
radius_se 0
texture_se 0
perimeter_se 0
area_se 0
smoothness_se 0
compactness_se 0
concavity_se 0
concave_points_se 0
symmetry_se 0
fractal_dimension_se 0
radius_worst 0
texture_worst 0
perimeter_worst 0
area_worst 0
smoothness_worst 0
compactness_worst 0
concavity_worst 0
concave_points_worst 0
symmetry_worst 0
fractal_dimension_worst 0
dtype: int64
In [15]: data.duplicated().sum()
Out[15]: np.int64(0)
Discriptive Statistics
In [18]: df.describe().T
85
Out[18]: count mean std min 25% 50%
86
count mean std min 25% 50%
Model Building
In [29]: # Split the dataset into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_sta
Out[30]: ▾ DecisionTreeClassifier i ?
DecisionTreeClassifier(criterion='entropy')
for feature in X:
ig = information_gain(df,feature,'diagnosis')
print(f"Information Gain for {feature}: {ig}")
87
Information Gain for radius_mean: 0.8607815854835991
Information Gain for texture_mean: 0.8357118798482908
Information Gain for perimeter_mean: 0.9267038614138748
Information Gain for area_mean: 0.9280305529818247
Information Gain for smoothness_mean: 0.7761788341876101
Information Gain for compactness_mean: 0.9091291689709926
Information Gain for concavity_mean: 0.9350604299589776
Information Gain for concave_points_mean: 0.9420903069361305
Information Gain for symmetry_mean: 0.735036638169654
Information Gain for fractal_dimension_mean: 0.8361770160635639
Information Gain for radius_se: 0.9337337383910278
Information Gain for texture_se: 0.8642965239721755
Information Gain for perimeter_se: 0.9315454914704012
Information Gain for area_se: 0.925377169845925
Information Gain for smoothness_se: 0.9350604299589776
Information Gain for compactness_se: 0.9231889229252984
Information Gain for concavity_se: 0.9280305529818247
Information Gain for concave_points_se: 0.8585933385629725
Information Gain for symmetry_se: 0.8181371874054084
Information Gain for fractal_dimension_se: 0.9174857375160954
Information Gain for radius_worst: 0.9003074642106167
Information Gain for texture_worst: 0.8634349686194988
Information Gain for perimeter_worst: 0.8985843535052632
Information Gain for area_worst: 0.9350604299589776
Information Gain for smoothness_worst: 0.7197189097252679
Information Gain for compactness_worst: 0.9183472928687721
Information Gain for concavity_worst: 0.9302187999024514
Information Gain for concave_points_worst: 0.9148323543801957
Information Gain for symmetry_worst: 0.8453951399613433
Information Gain for fractal_dimension_worst: 0.8915544765281104
88
Out[35]:
89
Out[36]: array([0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0,
1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0,
0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0,
1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1,
0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0,
1, 0, 0, 1])
Accuracy: 94.73684210526315
Classification Report:
precision recall f1-score support
In [45]: df.head(1)
In [44]: new = [[12.5, 19.2, 80.0, 500.0, 0.085, 0.1, 0.05, 0.02, 0.17, 0.06,
0.4, 1.0, 2.5, 40.0, 0.006, 0.02, 0.03, 0.01, 0.02, 0.003,
16.0, 25.0, 105.0, 900.0, 0.13, 0.25, 0.28, 0.12, 0.29, 0.08]]
y_pred = model.predict(new)
Prediction: Benign
90
Experiment 9
Develop a program to implement the Naive Bayesian
classifier, considering the Olivetti Face Data set for training.
Compute the accuracy of the classifier, considering a few
test data set.
The Olivetti Face Dataset is a collection of images of faces, used primarily for face
recognition tasks. The dataset contains 400 images of 40 different individuals, with 10
images per person. The dataset was created for research in machine learning and pattern
recognition, especially in the context of facial recognition.
*40 People: The dataset contains 40 different individuals, and each individual *Has 10
different images.
*Image Size: Each image is 64x64 pixels, resulting in 4096 features (flattened vector) per
image.
*Target Labels: Each image is associated with a label representing the individual (0 to 39).
It is widely used for text classification, spam detection, medical diagnosis, and facial
recognition.
Bayes' Theorem
The core idea of the Naïve Bayes classifier is based on Bayes' Theorem, which states:
where:
91
P(A|B) → Probability of hypothesis A (class) given evidence B (features).
P(B|A) → Probability of evidence B given hypothesis A.
P(A) → Prior probability of class A .
P(B) → Prior probability of feature B.
2. Prediction Phase
For a new test sample, calculate posterior probabilities for each class.
Assign the class with the highest probability to the test sample.
Used for discrete feature values, especially in text classification (e.g., spam
filtering).
Works well with word frequency counts.
3. Bernoulli Naïve Bayes
Performance Evaluation
To assess the classifier’s accuracy, the following metrics are used:
In [59]: data.keys()
93
# Set up figure size based on the number of images
grid_size = int(np.ceil(np.sqrt(top_n)))
fig, axes = plt.subplots(grid_size, grid_size, figsize=(15, 15))
fig.subplots_adjust(left=0, right=1, bottom=0, top=1, hspace=0.2, wspace=0.2)
for i, ax in enumerate(axes.ravel()):
if i < top_n:
ax.imshow(images[i], cmap='bone')
ax.axis('off')
ax.text(2, 12, str(target[i]), fontsize=9, color='red')
ax.text(2, 55, f"face: {i}", fontsize=9, color='blue')
else:
ax.axis('off')
plt.show()
In [62]: print_faces(data.images,data.target,400)
94
In [63]: #let us extract unique charaters present in dataset
def display_unique_faces(pics):
fig = plt.figure(figsize=(24, 10)) # Set figure size
columns, rows = 10, 4 # Define grid dimensions
In [64]: display_unique_faces(data.images)
print("x_train: ",x_train.shape)
print("x_test: ",x_test.shape)
95
# Predict the test set results
y_pred = nb.predict(x_test)
# Calculate accuracy
nb_accuracy = round(accuracy_score(y_test, y_pred) * 100, 2)
Confusion Matrix:
[[3 0 0 ... 0 0 0]
[0 1 0 ... 0 0 0]
[0 0 1 ... 0 0 0]
...
[0 0 0 ... 2 0 0]
[0 0 0 ... 0 3 0]
[1 0 0 ... 0 0 1]]
Naive Bayes Accuracy: 73.33%
# Calculate accuracy
accuracy = round(accuracy_score(y_test, y_pred) * 100, 2)
print(f"Multinomial Naive Bayes Accuracy: {accuracy}%")
96
plt.axis('off')
plt.show()
97
Class 0 AUC: 0.92
Class 1 AUC: 1.00
Class 2 AUC: 1.00
Class 3 AUC: 1.00
Class 4 AUC: 1.00
Class 5 AUC: 1.00
Class 6 AUC: 1.00
Class 7 AUC: 1.00
Class 8 AUC: 1.00
Class 9 AUC: 1.00
Class 10 AUC: 1.00
Class 11 AUC: 1.00
Class 12 AUC: 0.87
Class 13 AUC: 1.00
Class 14 AUC: 1.00
Class 15 AUC: 1.00
Class 16 AUC: 0.65
Class 17 AUC: 0.16
Class 18 AUC: 0.36
Class 19 AUC: 0.89
Class 20 AUC: 0.52
Class 21 AUC: 0.81
Class 22 AUC: 0.13
Class 23 AUC: 0.34
Class 24 AUC: 0.64
Class 25 AUC: 0.55
Class 26 AUC: 0.48
Class 27 AUC: 0.38
Class 28 AUC: 0.62
Class 29 AUC: 0.73
Class 30 AUC: 0.55
Class 31 AUC: 0.17
Class 32 AUC: 0.47
Class 33 AUC: 0.67
Class 34 AUC: 0.31
Class 35 AUC: 0.03
Class 36 AUC: 0.91
Class 37 AUC: 0.87
Class 38 AUC: 0.47
98
Experiment 10
Develop a program to implement k-means clustering using
Wisconsin Breast Cancer data set and visualize the
clustering result.
One of the most widely used clustering algorithms is K-Means Clustering, which divides the
dataset into K clusters, where each data point belongs to the nearest cluster center.
Mathematical Representation
The objective is to minimize the sum of squared distances (SSD) between data points
and their assigned cluster centroid:
99
where:
K = Number of clusters
xj = Data point
μi = Centroid of cluster Ci
1. Elbow Method:
Manhattan Distance
Cosine Similarity
Mahalanobis Distance
100
Challenges of K-Means Clustering
❌ Sensitive to Initial Centroid Selection – Different initializations may lead to different
results.
❌ Not Suitable for Non-Spherical Clusters – Assumes clusters are circular and evenly
sized.
❌ Outliers Affect Centroids – Presence of outliers can distort clustering results.
Visualization of Clusters
After applying K-Means Clustering, the results can be visualized using:
import warnings
warnings.filterwarnings('ignore')
In [18]: data.head()
101
Out[18]:
id diagnosis radius_mean texture_mean perimeter_mean area_mean smoothn
5 rows × 33 columns
In [19]: data.shape
In [20]: data.info()
102
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 33 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 569 non-null int64
1 diagnosis 569 non-null object
2 radius_mean 569 non-null float64
3 texture_mean 569 non-null float64
4 perimeter_mean 569 non-null float64
5 area_mean 569 non-null float64
6 smoothness_mean 569 non-null float64
7 compactness_mean 569 non-null float64
8 concavity_mean 569 non-null float64
9 concave points_mean 569 non-null float64
10 symmetry_mean 569 non-null float64
11 fractal_dimension_mean 569 non-null float64
12 radius_se 569 non-null float64
13 texture_se 569 non-null float64
14 perimeter_se 569 non-null float64
15 area_se 569 non-null float64
16 smoothness_se 569 non-null float64
17 compactness_se 569 non-null float64
18 concavity_se 569 non-null float64
19 concave points_se 569 non-null float64
20 symmetry_se 569 non-null float64
21 fractal_dimension_se 569 non-null float64
22 radius_worst 569 non-null float64
23 texture_worst 569 non-null float64
24 perimeter_worst 569 non-null float64
25 area_worst 569 non-null float64
26 smoothness_worst 569 non-null float64
27 compactness_worst 569 non-null float64
28 concavity_worst 569 non-null float64
29 concave points_worst 569 non-null float64
30 symmetry_worst 569 non-null float64
31 fractal_dimension_worst 569 non-null float64
32 Unnamed: 32 0 non-null float64
dtypes: float64(31), int64(1), object(1)
memory usage: 146.8+ KB
In [21]: data.diagnosis.unique()
Data Preprocessing
Data Cleaning
In [22]: data.isnull().sum()
103
Out[22]: id 0
diagnosis 0
radius_mean 0
texture_mean 0
perimeter_mean 0
area_mean 0
smoothness_mean 0
compactness_mean 0
concavity_mean 0
concave points_mean 0
symmetry_mean 0
fractal_dimension_mean 0
radius_se 0
texture_se 0
perimeter_se 0
area_se 0
smoothness_se 0
compactness_se 0
concavity_se 0
concave points_se 0
symmetry_se 0
fractal_dimension_se 0
radius_worst 0
texture_worst 0
perimeter_worst 0
area_worst 0
smoothness_worst 0
compactness_worst 0
concavity_worst 0
concave points_worst 0
symmetry_worst 0
fractal_dimension_worst 0
Unnamed: 32 569
dtype: int64
In [23]: data.duplicated().sum()
Out[23]: np.int64(0)
Discriptive Statistics
In [26]: df.describe().T
104
Out[26]: count mean std min 25% 50%
105
count mean std min 25% 50%
Standardized the features to have mean 0 and standard deviation 1 (important for PCA and
K-Means).
In [48]: #Use the Elbow Method to determine the optimal number of clusters
wcss = [] # Within-Cluster Sum of Squares
K_range = range(1, 11)
for k in K_range:
kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
106
kmeans.fit(X_pca)
wcss.append(kmeans.inertia_) # Append the inertia (sum of squared distances)
In [50]: #Apply K-Means Clustering with the optimal k (usually where elbow occurs, k=2)
optimal_k = 2
kmeans = KMeans(n_clusters=optimal_k, random_state=42, n_init=10)
clusters = kmeans.fit_predict(X_pca)
107
plt.legend()
plt.show()
The Elbow Method should show a bend at k=2, confirming that two clusters are optimal.
The final scatter plot should show two distinct clusters corresponding to malignant and
benign tumors.
108