ML Lab - BCSL606
ML Lab - BCSL606
LAB MANUAL
USN: ______________________________________________
Book 1: Chapter 2
2 Develop a program to Compute the correlation matrix to understand the relationships between pairs of
features. Visualize the correlation matrix using a heatmap to know which variables have strong
positive/negative correlations. Create a pair plot to visualize pairwise relationships between features. Use
California Housing dataset.
Book 1: Chapter 2
3 Develop a program to implement Principal Component Analysis (PCA) for reducing the dimensionality of
the Iris dataset from 4 features to 2.
Book 1: Chapter 2
4 For a given set of training data examples stored in a .CSV file, implement and demonstrate the Find-S
algorithm to output a description of the set of all hypotheses consistent with the training examples.
Book 1: Chapter 3
5 Develop a program to implement k-Nearest Neighbour algorithm to classify the randomly generated 100
values of x in the range of [0,1]. Perform the following based on dataset generated.
a. Label the first 50 points {x1,……,x50} as follows: if (xi ≤ 0.5), then xi ∊ Class1, else xi ∊ Class1
b. Classify the remaining points, x51,……,x100 using KNN. Perform this for k=1,2,3,4,5,20,30
Book 2: Chapter – 2
6 Implement the non-parametric Locally Weighted Regression algorithm in order to fit data points. Select
appropriate data set for your experiment and draw graphs
Book 1: Chapter – 4
7 Develop a program to demonstrate the working of Linear Regression and Polynomial Regression. Use
Boston Housing Dataset for Linear Regression and Auto MPG Dataset (for vehicle fuel efficiency prediction)
for Polynomial Regression.
Book 1: Chapter – 5
8 Develop a program to demonstrate the working of the decision tree algorithm. Use Breast Cancer Data set
for building the decision tree and apply this knowledge to classify a new sample.
Book 2: Chapter – 3
@#@11012025
Template for Practical Course and if AEC is a practical Course Annexure-V
9 Develop a program to implement the Naive Bayesian classifier considering Olivetti Face Data set for training.
Compute the accuracy of the classifier, considering a few test data sets.
Book 2: Chapter – 4
10 Develop a program to implement k-means clustering using Wisconsin Breast Cancer data set and visualize
the clustering result.
Book 2: Chapter – 4
Course outcomes (Course Skill Set):
At the end of the course the student will be able to:
● Illustrate the principles of multivariate data and apply dimensionality reduction techniques.
● Demonstrate similarity-based learning methods and perform regression analysis.
● Develop decision trees for classification and regression problems, and Bayesian models for probabilistic
learning.
• Implement the clustering algorithms to share computing resources.
Assessment Details (both CIE and SEE)
The weightage of Continuous Internal Evaluation (CIE) is 50% and for Semester End Exam (SEE) is 50%.
The minimum passing mark for the CIE is 40% of the maximum marks (20 marks out of 50) and for the
SEE minimum passing mark is 35% of the maximum marks (18 out of 50 marks). A student shall be
deemed to have satisfied the academic requirements and earned the credits allotted to each subject/
course if the student secures a minimum of 40% (40 marks out of 100) in the sum total of the CIE
(Continuous Internal Evaluation) and SEE (Semester End Examination) taken together
@#@11012025
Machine Learning Lab (BCSL606)
Program 1
Develop a program to create histograms for all numerical features and analyze the
distribution of each feature. Generate box plots for all numerical features and identify any
outliers. Use California Housing dataset.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
df.head()
df.shape
(20640, 10)
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 longitude 20640 non-null float64
1 latitude 20640 non-null float64
2 housing_median_age 20640 non-null float64
3 total_rooms 20640 non-null float64
4 total_bedrooms 20433 non-null float64
5 population 20640 non-null float64
6 households 20640 non-null float64
1
Machine Learning Lab (BCSL606)
df.nunique()
longitude 844
latitude 862
housing_median_age 52
total_rooms 5926
total_bedrooms 1923
population 3888
households 1815
median_income 12928
median_house_value 3842
ocean_proximity 5
dtype: int64
Data Cleaning
df.isnull().sum()
longitude 0
latitude 0
housing_median_age 0
total_rooms 0
total_bedrooms 207
population 0
households 0
median_income 0
median_house_value 0
ocean_proximity 0
dtype: int64
df.duplicated().sum()
df['total_bedrooms'].median()
435.0
Feature Engineering
for i in df.iloc[:,2:7]:
df[i] = df[i].astype('int')
df.head()
2
Machine Learning Lab (BCSL606)
Disciptive Statistics
df.describe().T
Numerical = df.select_dtypes(include=[np.number]).columns
print(Numerical)
Uni-Variate Analysis
for col in Numerical:
plt.figure(figsize=(10, 6))
3
Machine Learning Lab (BCSL606)
4
Machine Learning Lab (BCSL606)
5
Machine Learning Lab (BCSL606)
6
Machine Learning Lab (BCSL606)
1. Longitude:
2. The dataset contains houses located in specific regions (possibly coastal areas or
urban zones) as indicated by the bimodal peaks. Houses are not uniformly
distributed across all longitudes.
3. Latitude:
7. Total Rooms:
8. The highly skewed distribution shows most houses have a lower total number of
rooms. A few properties with a very high number of rooms could represent outliers
(e.g., mansions or multi-unit buildings).
9. Median Income:
10. Most households fall within a low-to-mid income bracket. The steep decline after
the peak suggests a small proportion of high-income households in the dataset.
11. Population:
7
Machine Learning Lab (BCSL606)
12. Most areas in the dataset have a relatively low population. However, there are some
highly populated areas, as evidenced by the long tail. These may represent urban
centers.
14. The sharp peak at the end of the histogram suggests that house prices in the dataset
are capped at a maximum value, which could limit the variability in predictions.
for col in Numerical:
plt.figure(figsize=(6, 6))
sns.boxplot(df[col], color='blue')
plt.title(col)
plt.ylabel(col)
plt.show()
8
Machine Learning Lab (BCSL606)
9
Machine Learning Lab (BCSL606)
10
Machine Learning Lab (BCSL606)
2. Total Bedrooms: Numerous data points above the upper whisker indicate a
significant presence of outliers with very high total_bedrooms values.
3. Population: There are numerous outliers above the upper whisker, with extreme
population values reaching beyond 35,000.
4. Households There is a significant number of outliers above the upper whisker.
These values represent areas with an unusually high number of households.
5. Median Income: There are numerous data points above the upper whisker, marked
as circles. These are considered potential outliers.
6. Median House Value: A small cluster of outliers is visible near the maximum
value of 500,000.
Program 2:
# Convert to DataFrame
df = pd.DataFrame(data.data, columns=data.feature_names)
df['Target'] = data.target # Adding the target variable (median house value)
variable_df = pd.DataFrame(list(variable_meaning.items()),
columns=["Feature", "Description"])
print("\nVariable Meaning Table:")
print(variable_df)
12
Machine Learning Lab (BCSL606)
Longitude Target
0 -122.23 4.526
1 -122.22 3.585
2 -122.24 3.521
3 -122.25 3.413
4 -122.25 3.422
# Summary Statistics
print("\nSummary Statistics:")
print(df.describe()) # Summary statistics of dataset
Summary Statistics:
13
Machine Learning Lab (BCSL606)
The summary statistics table provides key percentiles and other descriptive
metrics for each numerical feature:
- **25% (First Quartile - Q1):** This represents the value below which 25% of
the data falls. It helps in understanding the lower bound of typical data
values.
- **50% (Median - Q2):** This is the middle value when the data is sorted. It
provides the central tendency of the dataset.
- **75% (Third Quartile - Q3):** This represents the value below which 75% of
the data falls. It helps in identifying the upper bound of typical values in
14
Machine Learning Lab (BCSL606)
the dataset.
- These percentiles are useful for detecting skewness, data distribution, and
identifying potential outliers (values beyond Q1 - 1.5*IQR or Q3 + 1.5*IQR).
15
Machine Learning Lab (BCSL606)
# Correlation Matrix
plt.figure(figsize=(10, 6))
corr_matrix = df.corr()
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt='.2f')
16
Machine Learning Lab (BCSL606)
17
Machine Learning Lab (BCSL606)
Key Insights:
1. The dataset has 20640 rows and 9 columns.
2. No missing values were found in the dataset.
3. Histograms show skewed distributions in some features like 'MedInc'.
4. Boxplots indicate potential outliers in 'AveRooms' and 'AveOccup'.
5. Correlation heatmap shows 'MedInc' has the highest correlation with house
prices.
18
Machine Learning Lab (BCSL606)
Program 3
Develop a program to implement Principal Component Analysis (PCA) for reducing the
dimensionality of the Iris dataset from 4 features to 2.
the Iris dataset from 4 features to 2.
#
# The goal of using PCA in this exercise is to reduce these four features
into two principal components.
# This will help in visualizing the data better and understanding its
underlying structure.
#
# Since humans struggle to visualize data in more than three dimensions,
reducing the data to 2D allows us to
# retain the most important patterns while making it easier to interpret. PCA
helps us achieve this while
# preserving as much variance as possible.
These features were chosen because they effectively differentiate between the three iris
species (Setosa, Versicolor, and Virginica).
In the 3D visualizations, we select three features for plotting, which are:
Feature 1 → Sepal Length
Feature 2 → Sepal Width
Feature 3 → Petal Length
These features are chosen arbitrarily for visualization, but all four features are used in the
PCA computation. Why is the Iris Dataset Important?
The Iris dataset is a benchmark dataset in machine learning because:
19
Machine Learning Lab (BCSL606)
Since the dataset contains three classes (Setosa, Versicolor, and Virginica), PCA helps
visualize how well the classes can be separated in a lower-dimensional space.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from sklearn import datasets
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
20
Machine Learning Lab (BCSL606)
plt.legend()
plt.show()
plt.legend()
plt.show()
# Recap:
# - The Iris dataset is historically important for testing classification
models.
# - We standardized the data to ensure fair comparison across features.
# - We calculated the covariance matrix, eigenvalues, and eigenvectors.
# - PCA is built on SVD, which decomposes data into important components.
# - We visualized the original 3D data and superimposed eigenvectors.
# - We applied PCA to reduce the dimensionality from 4D to 2D.
# - Finally, we visualized the transformed data in 2D space.
22
Machine Learning Lab (BCSL606)
23
Machine Learning Lab (BCSL606)
24
Machine Learning Lab (BCSL606)
25
Machine Learning Lab (BCSL606)
Program 4
For a given set of training data examples stored in a .CSV file, implement and demonstrate
the Find-S algorithm to output a description of the set of all hypotheses consistent with the
training examples.
import pandas as pd
print(data)
def find_s_algorithm(data):
"""Implements the Find-S algorithm to find the most specific
hypothesis."""
# Extract feature columns and target column
attributes = data.iloc[:, :-1].values # All columns except last
target = data.iloc[:, -1].values # Last column (class labels)
return hypothesis
27
Machine Learning Lab (BCSL606)
Program 5:
Develop a program to implement k-Nearest Neighbour algorithm to classify the randomly
generated 100 values of x in the range of [0,1]. Perform the following based on dataset
generated.
3. Label the first 50 points {x1,......,x50} as follows: if (xi ≤ 0.5), then xi ∊ Class1, else xi ∊
Class1
4. Classify the remaining points, x51,......,x100 using KNN. Perform this for
k=1,2,3,4,5,20,30
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
import warnings
warnings.filterwarnings('ignore')
labels = []
for i in values[:50]:
if i <=0.5:
labels.append('Class1')
else:
labels.append('Class2')
labels += [None] * 50
print(labels)
data = {
"Point": [f"x{i+1}" for i in range(100)],
28
Machine Learning Lab (BCSL606)
"Value": values,
"Label": labels
}
print(data)
type(data)
{'Point': ['x1', 'x2', 'x3', 'x4', 'x5', 'x6', 'x7', 'x8', 'x9', 'x10',
'x11', 'x12', 'x13', 'x14', 'x15', 'x16', 'x17', 'x18', 'x19', 'x20', 'x21',
'x22', 'x23', 'x24', 'x25', 'x26', 'x27', 'x28', 'x29', 'x30', 'x31', 'x32',
'x33', 'x34', 'x35', 'x36', 'x37', 'x38', 'x39', 'x40', 'x41', 'x42', 'x43',
'x44', 'x45', 'x46', 'x47', 'x48', 'x49', 'x50', 'x51', 'x52', 'x53', 'x54',
'x55', 'x56', 'x57', 'x58', 'x59', 'x60', 'x61', 'x62', 'x63', 'x64', 'x65',
'x66', 'x67', 'x68', 'x69', 'x70', 'x71', 'x72', 'x73', 'x74', 'x75', 'x76',
'x77', 'x78', 'x79', 'x80', 'x81', 'x82', 'x83', 'x84', 'x85', 'x86', 'x87',
'x88', 'x89', 'x90', 'x91', 'x92', 'x93', 'x94', 'x95', 'x96', 'x97', 'x98',
'x99', 'x100'], 'Value': array([0.37454012, 0.95071431, 0.73199394,
0.59865848, 0.15601864,
0.15599452, 0.05808361, 0.86617615, 0.60111501, 0.70807258,
0.02058449, 0.96990985, 0.83244264, 0.21233911, 0.18182497,
0.18340451, 0.30424224, 0.52475643, 0.43194502, 0.29122914,
0.61185289, 0.13949386, 0.29214465, 0.36636184, 0.45606998,
0.78517596, 0.19967378, 0.51423444, 0.59241457, 0.04645041,
0.60754485, 0.17052412, 0.06505159, 0.94888554, 0.96563203,
0.80839735, 0.30461377, 0.09767211, 0.68423303, 0.44015249,
0.12203823, 0.49517691, 0.03438852, 0.9093204 , 0.25877998,
0.66252228, 0.31171108, 0.52006802, 0.54671028, 0.18485446,
0.96958463, 0.77513282, 0.93949894, 0.89482735, 0.59789998,
0.92187424, 0.0884925 , 0.19598286, 0.04522729, 0.32533033,
0.38867729, 0.27134903, 0.82873751, 0.35675333, 0.28093451,
0.54269608, 0.14092422, 0.80219698, 0.07455064, 0.98688694,
0.77224477, 0.19871568, 0.00552212, 0.81546143, 0.70685734,
0.72900717, 0.77127035, 0.07404465, 0.35846573, 0.11586906,
0.86310343, 0.62329813, 0.33089802, 0.06355835, 0.31098232,
0.32518332, 0.72960618, 0.63755747, 0.88721274, 0.47221493,
0.11959425, 0.71324479, 0.76078505, 0.5612772 , 0.77096718,
0.4937956 , 0.52273283, 0.42754102, 0.02541913, 0.10789143]), 'Label':
['Class1', 'Class2', 'Class2', 'Class2', 'Class1', 'Class1', 'Class1',
'Class2', 'Class2', 'Class2', 'Class1', 'Class2', 'Class2', 'Class1',
'Class1', 'Class1', 'Class1', 'Class2', 'Class1', 'Class1', 'Class2',
'Class1', 'Class1', 'Class1', 'Class1', 'Class2', 'Class1', 'Class2',
'Class2', 'Class1', 'Class2', 'Class1', 'Class1', 'Class2', 'Class2',
'Class2', 'Class1', 'Class1', 'Class2', 'Class1', 'Class1', 'Class1',
'Class1', 'Class2', 'Class1', 'Class2', 'Class1', 'Class2', 'Class2',
'Class1', None, None, None, None, None, None, None, None, None, None, None,
None, None, None, None, None, None, None, None, None, None, None, None, None,
None, None, None, None, None, None, None, None, None, None, None, None, None,
None, None, None, None, None, None, None, None, None, None, None, None,
None]}
dict
29
Machine Learning Lab (BCSL606)
df = pd.DataFrame(data)
df.head()
df.nunique()
Point 100
Value 100
Label 2
dtype: int64
df.shape
(100, 3)
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Point 100 non-null object
1 Value 100 non-null float64
2 Label 50 non-null object
dtypes: float64(1), object(2)
memory usage: 2.5+ KB
df.describe().T
max
Value 0.986887
df.isnull().sum()
Point 0
Value 0
Label 50
dtype: int64
plt.xlabel(col)
plt.ylabel('Frequency')
plt.show()
unlabeled_df = df[df["Label"].isna()]
X_test = unlabeled_df[["Value"]]
for k in k_values:
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(X_train, y_train)
predictions = knn.predict(X_test)
results[k] = predictions
# Calculate accuracy
accuracy = accuracy_score(true_labels, predictions) * 100
accuracies[k] = accuracy
31
Machine Learning Lab (BCSL606)
print(predictions)
32
Machine Learning Lab (BCSL606)
Label_k30
50 Class2
51 Class2
52 Class2
53 Class2
54 Class2
55 Class2
56 Class1
57 Class1
58 Class1
59 Class1
60 Class1
61 Class1
62 Class2
63 Class1
64 Class1
65 Class2
66 Class1
67 Class2
68 Class1
69 Class2
70 Class2
71 Class1
33
Machine Learning Lab (BCSL606)
72 Class1
73 Class2
74 Class2
75 Class2
76 Class2
77 Class1
78 Class1
79 Class1
80 Class2
81 Class2
82 Class1
83 Class1
84 Class1
85 Class1
86 Class2
87 Class2
88 Class2
89 Class1
90 Class1
91 Class2
92 Class2
93 Class2
94 Class2
95 Class1
96 Class2
97 Class1
98 Class1
99 Class1
# Display accuracies
print("\nAccuracies for different k values:")
for k, acc in accuracies.items():
print(f"k={k}: {acc:.2f}%")
34
Machine Learning Lab (BCSL606)
Program 6
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
from scipy.spatial.distance import cdist
# Load datasets
df_linear = pd.read_csv("linear_dataset.csv")
df_lwr = pd.read_csv("lwr_dataset.csv")
df_poly = pd.read_csv("polynomial_dataset.csv")
# Linear Regression
def linear_regression(df):
X, y = df[['X']], df['Y']
model = LinearRegression()
model.fit(X, y)
y_pred = model.predict(X)
plt.scatter(X, y, label='Data')
plt.plot(X, y_pred, color='red', label='Linear Regression')
plt.legend()
plt.title("Linear Regression")
plt.show()
linear_regression(df_linear)
35
Machine Learning Lab (BCSL606)
for x in X_range:
x_vec = np.array([1, x]) # Intercept term
weights = gaussian_kernel(x, X_train[:, 1:], tau).flatten()
W = np.diag(weights)
locally_weighted_regression(df_lwr[['X']].values, df_lwr['Y'].values)
# Polynomial Regression
def polynomial_regression(df, degree=3):
X, y = df[['X']], df['Y']
36
Machine Learning Lab (BCSL606)
polynomial_regression(df_poly, degree=3)
37
Machine Learning Lab (BCSL606)
Program 7:
Develop a program to demonstrate the working of Linear Regression and Polynomial
Regression. Use Boston Housing Dataset for Linear Regression and Auto MPG Dataset (for
vehicle fuel efficiency prediction) for Polynomial Regression.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler
import warnings
warnings.filterwarnings('ignore')
B LSTAT MEDV
0 396.90 4.98 24.0
1 396.90 9.14 21.6
38
Machine Learning Lab (BCSL606)
data.shape
(506, 14)
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 CRIM 486 non-null float64
1 ZN 486 non-null float64
2 INDUS 486 non-null float64
3 CHAS 486 non-null float64
4 NOX 506 non-null float64
5 RM 506 non-null float64
6 AGE 486 non-null float64
7 DIS 506 non-null float64
8 RAD 506 non-null int64
9 TAX 506 non-null int64
10 PTRATIO 506 non-null float64
11 B 506 non-null float64
12 LSTAT 486 non-null float64
13 MEDV 506 non-null float64
dtypes: float64(12), int64(2)
memory usage: 55.5 KB
19. The dataset contains 506 entries and 14 columns, with 6 columns (CRIM, ZN, INDUS,
CHAS, AGE, LSTAT) having 20 missing values each.
20. Most columns are continuous (float64), while RAD and TAX are discrete (int64).
21. MEDV (median home value) is the target variable, likely influenced by features like
RM (average rooms) and LSTAT (lower-status population).
22. Missing values need to be addressed through imputation or by dropping rows with
missing data.
23. Exploratory analysis and modeling can help understand feature relationships and
predict MEDV.
data.nunique()
CRIM 484
ZN 26
INDUS 76
CHAS 2
NOX 81
RM 446
AGE 348
39
Machine Learning Lab (BCSL606)
DIS 412
RAD 9
TAX 66
PTRATIO 46
B 357
LSTAT 438
MEDV 229
dtype: int64
data.CHAS.unique()
data.ZN.unique()
Data Cleaning
Checking Null values
data.isnull() - Returns a DataFrame of the same shape as data, where each element is True
if it's NaN and False otherwise.
.sum() - Sums up the True values (which are treated as 1 in Python) column-wise, giving
the total count of missing values for each column.
data.isnull().sum()
CRIM 20
ZN 20
INDUS 20
CHAS 20
NOX 0
RM 0
AGE 20
DIS 0
RAD 0
TAX 0
PTRATIO 0
B 0
LSTAT 20
MEDV 0
dtype: int64
data.duplicated().sum()
np.int64(0)
df = data.copy()
40
Machine Learning Lab (BCSL606)
df['CRIM'].fillna(df['CRIM'].mean(), inplace=True)
df['ZN'].fillna(df['ZN'].mean(), inplace=True)
df['CHAS'].fillna(df['CHAS'].mode()[0], inplace=True)
df['INDUS'].fillna(df['INDUS'].mean(), inplace=True)
df['AGE'].fillna(df['AGE'].median(), inplace=True) # Median is often
preferred for skewed distributions
df['LSTAT'].fillna(df['LSTAT'].median(), inplace=True)
df.isnull().sum()
CRIM 0
ZN 0
INDUS 0
CHAS 0
NOX 0
RM 0
AGE 0
DIS 0
RAD 0
TAX 0
PTRATIO 0
B 0
LSTAT 0
MEDV 0
dtype: int64
df.head()
B LSTAT MEDV
0 396.90 4.98 24.0
1 396.90 9.14 21.6
2 392.83 4.03 34.7
3 394.63 2.94 33.4
4 396.90 11.43 36.2
df['CHAS'] = df['CHAS'].astype('int')
df.describe().T
75% max
CRIM 3.611874 88.9762
ZN 11.211934 100.0000
INDUS 18.100000 27.7400
CHAS 0.000000 1.0000
NOX 0.624000 0.8710
RM 6.623500 8.7800
AGE 93.575000 100.0000
DIS 5.188425 12.1265
RAD 24.000000 24.0000
TAX 666.000000 711.0000
PTRATIO 20.200000 22.0000
B 396.225000 396.9000
LSTAT 16.570000 37.9700
MEDV 25.000000 50.0000
for i in df.columns:
plt.figure(figsize=(6,3))
plt.subplot(1, 2, 1)
df[i].hist(bins=20, alpha=0.5, color='b',edgecolor='black')
plt.title(f'Histogram of {i}')
plt.xlabel(i)
plt.ylabel('Frequency')
plt.subplot(1, 2, 2)
plt.boxplot(df[i], vert=False)
plt.title(f'Boxplot of {i}')
plt.show()
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-1-72aa2831224b> in <module>
----> 1 for i in df.columns:
2 plt.figure(figsize=(6,3))
3
4 plt.subplot(1, 2, 1)
5 df[i].hist(bins=20, alpha=0.5, color='b',edgecolor='black')
42
Machine Learning Lab (BCSL606)
corr = df.corr(method='pearson')
plt.figure(figsize=(10, 8))
sns.heatmap(corr, annot=True, cmap="coolwarm", fmt=".2f", linewidths=0.5)
plt.xticks(rotation=90, ha='right')
plt.yticks(rotation=0)
plt.title("Correlation Matrix Heatmap")
plt.show()
regularized models. While standard linear regression may not be heavily affected,
scaling ensures more consistent results.
# Scale the features
scale = StandardScaler()
X_scaled = scale.fit_transform(X)
# Split the data into training (80%) and testing (20%) sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled , y,
test_size=0.2, random_state=42)
LinearRegression()
44
Machine Learning Lab (BCSL606)
r2 = r2_score(y_test, y_pred)
45
Machine Learning Lab (BCSL606)
Program 8:
Develop a program to demonstrate the working of the decision tree algorithm. Use Breast
Cancer Data set for building the decision tree and apply this knowledge to classify a new
sample.
# Importing necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)
data.head()
fractal_dimension_worst
0 0.11890
1 0.08902
2 0.08758
3 0.17300
4 0.07678
data.shape
(569, 32)
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 32 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 569 non-null int64
1 diagnosis 569 non-null object
47
Machine Learning Lab (BCSL606)
data.diagnosis.unique()
Data Preprocessing
Data Cleaning
data.isnull().sum()
id 0
diagnosis 0
radius_mean 0
texture_mean 0
perimeter_mean 0
area_mean 0
smoothness_mean 0
compactness_mean 0
concavity_mean 0
48
Machine Learning Lab (BCSL606)
concave_points_mean 0
symmetry_mean 0
fractal_dimension_mean 0
radius_se 0
texture_se 0
perimeter_se 0
area_se 0
smoothness_se 0
compactness_se 0
concavity_se 0
concave_points_se 0
symmetry_se 0
fractal_dimension_se 0
radius_worst 0
texture_worst 0
perimeter_worst 0
area_worst 0
smoothness_worst 0
compactness_worst 0
concavity_worst 0
concave_points_worst 0
symmetry_worst 0
fractal_dimension_worst 0
dtype: int64
data.duplicated().sum()
np.int64(0)
df = data.drop(['id'], axis=1)
Discriptive Statistics
df.describe().T
49
Machine Learning Lab (BCSL606)
50
Machine Learning Lab (BCSL606)
corr = df.corr(method='pearson')
plt.figure(figsize=(18, 10))
sns.heatmap(corr, annot=True, cmap="coolwarm", fmt=".2f", linewidths=0.5)
plt.xticks(rotation=90, ha='right')
plt.yticks(rotation=0)
plt.title("Correlation Matrix Heatmap")
plt.show()
# Split the dataset into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
DecisionTreeClassifier(criterion='entropy')
import math
for feature in X:
ig = information_gain(df,feature,'diagnosis')
print(f"Information Gain for {feature}: {ig}")
52
Machine Learning Lab (BCSL606)
53
Machine Learning Lab (BCSL606)
y_pred = model.predict(X_test)
y_pred
array([0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0,
1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0,
0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0,
1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1,
0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0,
1, 0, 0, 1])
Accuracy: 94.73684210526315
Classification Report:
precision recall f1-score support
54
Machine Learning Lab (BCSL606)
df.head(1)
fractal_dimension_worst
0 0.1189
new = [[12.5, 19.2, 80.0, 500.0, 0.085, 0.1, 0.05, 0.02, 0.17, 0.06,
0.4, 1.0, 2.5, 40.0, 0.006, 0.02, 0.03, 0.01, 0.02, 0.003,
16.0, 25.0, 105.0, 900.0, 0.13, 0.25, 0.28, 0.12, 0.29, 0.08]]
y_pred = model.predict(new)
Prediction: Benign
55
Machine Learning Lab (BCSL606)
Program 9:
Develop a program to implement the Naive Bayesian classifier considering Olivetti
Face Data set for training. Compute the accuracy of the classifier, considering a few
test data set.
The Olivetti Face Dataset is a collection of images of faces, used primarily for face
recognition tasks. The dataset contains 400 images of 40 different individuals, with 10
images per person. The dataset was created for research in machine learning and pattern
recognition, especially in the context of facial recognition.
The Olivetti dataset provides the following key features:
*400 Images: Each image is a grayscale photo of a person's face.
40 People: The dataset contains 40 different individuals, and each individual Has 10 different
images.
*Image Size: Each image is 64x64 pixels, resulting in 4096 features (flattened vector) per
image.
*Target Labels: Each image is associated with a label representing the individual (0 to 39).
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
data.keys()
for i, ax in enumerate(axes.ravel()):
if i < top_n:
ax.imshow(images[i], cmap='bone')
ax.axis('off')
ax.text(2, 12, str(target[i]), fontsize=9, color='red')
ax.text(2, 55, f"face: {i}", fontsize=9, color='blue')
else:
ax.axis('off')
plt.show()
print_faces(data.images,data.target,400)
57
Machine Learning Lab (BCSL606)
display_unique_faces(data.images)
print("x_train: ",x_train.shape)
print("x_test: ",x_test.shape)
58
Machine Learning Lab (BCSL606)
# Calculate accuracy
nb_accuracy = round(accuracy_score(y_test, y_pred) * 100, 2)
Confusion Matrix:
[[3 0 0 ... 0 0 0]
[0 1 0 ... 0 0 0]
[0 0 1 ... 0 0 0]
...
[0 0 0 ... 2 0 0]
[0 0 0 ... 0 3 0]
[1 0 0 ... 0 0 1]]
Naive Bayes Accuracy: 73.33%
# Calculate accuracy
accuracy = round(accuracy_score(y_test, y_pred) * 100, 2)
print(f"Multinomial Naive Bayes Accuracy: {accuracy}%")
59
Machine Learning Lab (BCSL606)
61
Machine Learning Lab (BCSL606)
Program 10:
Develop a program to implement k-means clustering using Wisconsin Breast Cancer data
set and visualize the clustering result.
1. Load the dataset – Use sklearn.datasets to fetch the Wisconsin Breast Cancer
dataset.
2. Preprocess the data – Normalize features for better clustering.
3. Apply K-Means algorithm – Use KMeans from sklearn.cluster.
4. Evaluate clustering performance – Compare with actual labels using ARI or
silhouette score.
5. Visualize clusters – Use PCA for dimensionality reduction and plot clusters.
import numpy as np
import pandas as pd
# Load dataset
data = load_breast_cancer()
X = data.data
feature_names = data.feature_names
62
Machine Learning Lab (BCSL606)
df = pd.DataFrame(X, columns=feature_names)
df.head()
scaler = StandardScaler()
y_pred = kmeans.fit_predict(X_scaled)
df['cluster'] = y_pred
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
63
Machine Learning Lab (BCSL606)
plt.figure(figsize=(10, 6))
plt.legend(title="Cluster")
plt.show()
64