0% found this document useful (0 votes)
37 views74 pages

ML Lab FileDhruv

The document describes performing exploratory data analysis on a student dataset. It includes loading and exploring the data, handling missing values, calculating summary statistics, and visualizing relationships between variables using histograms, scatter plots, bar plots, and heatmaps to gain insights from the data.

Uploaded by

hahadevi235
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views74 pages

ML Lab FileDhruv

The document describes performing exploratory data analysis on a student dataset. It includes loading and exploring the data, handling missing values, calculating summary statistics, and visualizing relationships between variables using histograms, scatter plots, bar plots, and heatmaps to gain insights from the data.

Uploaded by

hahadevi235
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 74

MACHINE LEARNING

EXPERIMENT FILE

Submitted To: Dr.Anurag Jain

Submitted By:
Name: Dhruv Sharma
Course: BTech CSE (AI&ML Non-Honours)
SAP ID: 500107715
ROLL NO: R2142220916
Batch: 05
Lab 1: Exploratory Data Analysis
Objective:

The objective of this lab is to perform exploratory data analysis (EDA) on a given dataset.
EDA helps in understanding the structure of the data, identifying patterns, and gaining
insights that can be useful for further analysis.

Dataset:

The dataset for this assignment is a fictional dataset containing information about students. It
includes the following columns:

• Student ID

• Gender

• Age

• Study Hours

• Score

Student Gender Age Study Score


ID Hours
1 Male 19 3 85
2 Female 20 4 77
3 Female 22 3 90
4 Male 21 1 86
5 Female 22 9 98
6 Female 24 9 75
7 Female 22 53
8 Female 25 9 89
9 Female 25 3
10 Female 22 9 100
11 Female 21 5 80
12 Male 22 4 79
13 Male 22 99
14 Female 22 5 83
15 Male 18 4 68
16 Male 24 7 67
17 Male 25 10 79
18 Male 22 9 70
19 Male 21 1 52
20 Female 20 9 55
21 Male 25 6 87
22 Female 25 10 62
23 Female 23 1 94
24 Male 23 10 52
25 Male 23 7 97
26 Female 18 6 77
27 Female 19 4 71
28 Female 23 2 89
29 Female 19 9 89
30 Male 21 1 61
31 Female 18 5 72
32 Male 25 10 80
33 Female 23 7 67
34 Male 24 6 56
35 Female 18 8 57
36 Female 19 9 68
37 Male 20 9 78
38 Female 22 10 93
39 Female 20 3 69
40 Male 18 9 99
41 Male 23 7 91
42 Female 21 7 79
43 Male 20 10 96
44 Female 20 2 71
45 Female 23 7 59
46 Female 18 9 75
47 Female 25 9 100
48 Female 23 4 82
49 Male 19 3 77
50 Female 25 4 59
51 Male 18 7 78
52 Female 20 4 93
53 Female 20 7 67
54 Female 21 6 91
55 Female 20 8 50
56 Male 25 1 72
57 Female 25 9 66
58 Male 21 5 92
59 Male 20 7 86
60 Female 19 6 80
61 Female 20 9 74
62 Male 24 3 53
63 Female 21 4 58
64 Male 24 10 77
65 Female 21 8 79
66 Male 21 6 96
67 Male 20 4 73
68 Male 24 5 82
69 Male 21 6 69
70 Male 22 4 58
71 Female 19 4 57
72 Female 20 8 73
73 Male 21 10 63
74 Male 24 10 67
75 Male 19 10 50
76 Female 20 8 61
77 Female 19 4 78
78 Male 22 3 86
79 Female 20 4 75
80 Male 24 10 82
81 Male 21 8 92
82 Female 18 8 64
83 Male 21 6 72
84 Female 20 2 78
85 Female 21 3 70
86 Female 18 3 68
87 Female 18 9 54
88 Female 24 2 72
89 Female 18 6 85
90 Male 24 9 69
91 Female 21 5 57
92 Female 24 1 58
93 Male 20 3 63
94 Male 21 6 55
95 Female 18 6 50
96 Male 22 1 58
97 Male 18 9 65
98 Female 23 2 65
99 Female 24 2 61
100 Male 18 1 54

Tasks:

1. Load the dataset and display the first few rows to understand its structure.

2. Check for missing values and handle them appropriately.

3. Summarize the statistical properties of numerical variables (mean, median, min, max,
etc.).

4. Visualize the distribution of numerical variables using histograms.

5. Explore the relationship between numerical variables using scatter plots.

6. Analyze the distribution of categorical variables using bar plots.

7. Calculate summary statistics for categorical variables.


8. Explore potential correlations between variables using correlation matrices and
heatmaps.

Algorithm, Code and Output Graphs:

ALGORITHM:

1. Load the Dataset:


Read the dataset containing information about students, including columns like
Student ID, Gender, Age, Study Hours, and Score.

2. Display Dataset Structure:


Display the first few rows of the dataset to understand its structure.

3. Check for Missing Values:


Identify any missing values in the dataset and handle them appropriately. This may
involve imputation techniques such as mean, median, or mode imputation, or
dropping rows/columns with missing values depending on the context.

4. Summarize Statistical Properties:


Calculate summary statistics for numerical variables such as mean, median,
minimum, maximum, and quartiles. This provides insights into the central tendency
and spread of the data.

5. Visualize Numerical Variables:


Create histograms to visualize the distribution of numerical variables such as Age,
Study Hours, and Score. Histograms help in understanding the frequency distribution
of data.

6. Explore Relationship between Numerical Variables:


Use scatter plots to explore the relationship between numerical variables. For
example, analyze how Study Hours relate to Score or Age.

7. Analyze Distribution of Categorical Variables:


Create bar plots to analyze the distribution of categorical variables such as Gender.
This helps in understanding the frequency of each category within the dataset.

8. Calculate Summary Statistics for Categorical Variables:


Calculate summary statistics (e.g., frequency counts) for categorical variables. This
provides insights into the distribution of categories.

9. Explore Correlations Between Variables:


Create correlation matrices and heatmaps to explore potential correlations between
variables. This helps in understanding the strength and direction of relationships
between variables.
10. Output Graphs and Results:
Generate visualizations and summary statistics as output. This could include
histograms, scatter plots, bar plots, correlation matrices, and heatmaps to provide
insights into the dataset.

Code and Output:

1. Load the dataset and display the first few rows to understand its structure.

INPUT:

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

#load the dataset

df = pd.read_csv("C:\ML Dataset(Umang)\fictional_student_data.csv")

#Display the first few rows

print(df.head())

OUTPUT:

2. Check for missing values and handle them appropriately.


INPUT:

#Check for missing values


print(df.isnull().sum())

OUTPUT:

3. Summarize the statistical properties of numerical variables (mean, median, min, max,
etc.).

INPUT:

#Statisitical properties
print(df.describe())

OUTPUT:

4. Visualize the distribution of numerical variables using histograms.

INPUT:

plt.figure(figsize=(12, 6))

plt.subplot(2, 2, 1)

sns.histplot(df['Age'], kde=True)

plt.title('Age Distribution')
plt.subplot(2, 2, 2)

sns.histplot(df['Study Hours'], kde=True)

plt.title('Study Hours Distribution')

plt.subplot(2, 2, 3)

sns.histplot(df['Score'], kde=True)

plt.title('Score Distribution')

plt.tight_layout()

plt.show()

OUTPUT:

5. Explore the relationship between numerical variables using scatter plots.

INPUT:
plt.figure(figsize=(12, 6))

plt.subplot(1, 2, 1)

sns.scatterplot(x='Study Hours', y='Score', data=df)

plt.title('Study Hours vs Score')

plt.subplot(1, 2, 2)

sns.scatterplot(x='Age', y='Score', data=df)

plt.title('Age vs Score')

plt.tight_layout()

plt.show()

OUTPUT:
6. Analyze the distribution of categorical variables using bar plots.

INPUT:

plt.figure(figsize=(10, 5))

plt.subplot(1, 2, 1)

sns.countplot(x='Gender', data=df)

plt.title('Gender Distribution')

plt.tight_layout()

plt.show()

OUTPUT:

7. Calculate summary statistics for categorical variables.

INPUT:

print("Summary statistics for categorical variables:")


print(df['Gender'].value_counts())

OUTPUT:

8. Explore potential correlations between variables using correlation matrices and


heatmaps.

INPUT:

numeric_df = df.select_dtypes(include=['number'])

plt.figure(figsize=(8, 6))

sns.heatmap(numeric_df.corr(), annot=True, cmap='coolwarm', linewidths=0.5)

plt.title('Correlation Matrix')

plt.show()

OUTPUT:
Outcome:

This lab provides a comprehensive exploration of the dataset through various EDA
techniques in Python. It includes handling missing values, summarizing statistical properties,
visualizing distributions, analyzing relationships between variables, and explor ing
correlations.

Lab 2: Data preprocessing activities


Objective:

The objective of this lab is to perform various data preprocessing activities on a given dataset
using Python. Data preprocessing is essential for cleaning, transforming, and preparing the
data for machine learning models.

Dataset:

The dataset for this assignment is a fictional dataset containing information about customers.
It includes the following columns:

• Customer ID

• Age
• Gender

• Income

• Spending Score

Customer Age Gender Income Spending


ID Score
1 1.178106 0 0.762368 1.345439
2 1.361421 1 -1.40632 1.038137
3 -1.51052 0 0.48104 -1.76173
4 -1.3272 1 -0.03079 0.525966
5 -1.3272 0 -1.08667
6 0.87258 0 1.209625 -0.32765
7 -0.96057 0 0.53396 0.423532
8 -0.34952 0 1.154609 1.27715
9 -0.22731 0 1.129617 1.072281
10 1.544736 1 -0.49837
11 0.689265 0 -0.91944 -0.49837
12 -0.1051 0 1.217006 0.730834
13 -1.14389 0 -1.01454 -1.07883
14 -0.044 0 -1.66144 -0.80568
15 -0.044 1 0.442462 -1.24956
16 -0.77726 0 -1.17854 0.560111
17 -1.44941 0 -1.1585 0.594256
18 0.811475 1 -0.2535 -1.21541
19 0.87258 1 -1.31864 1.311294
20 -0.1051 0 -1.08801 -0.3618
21 1.300316 1 -0.13164 1.003992
22 -0.044 1 0.780819 -0.15693
23 -0.47173 1 -0.13923 -0.43009
24 0.75037 1 -1.72199 1.345439
25 0.017109 0 0.073829 -0.25936
26 -0.71615 0 1.37199 -1.11298
27 -1.02168 1 -0.29547 0.662545
28 -0.96057 1 -0.14653 1.20886
29 -0.28842 0 -1.02985 1.447873
30 1.605841 1 -1.07442 1.27715
31 -0.53284 0 -0.56393 0.662545
32 1.605841 0 -1.10784 -0.87397
33 -1.20499 0 1.612266 0.935703
34 -0.59394 0 0.150375
35 1.361421 1 0.951402 -1.83002
36 -1.51052 1 -1.27822 1.106426
37 -0.41063 1 -1.08159 0.321098
38 0.62816 0 0.663952 -1.2837
39 -0.044 1 -0.60081
40 1.483631 0 -1.52767 1.379584
41 1.605841 0 -0.27191 -1.01055
42 0.26153 0 -1.66869 0.321098
43 -0.34952 1 0.575976 0.457677
44 -0.34952 0 -1.21418 -0.08864
45 -0.65505 1 1.2516 -1.55686
46 0.87258 1 0.262947 0.082085
47 0.444845 0 -0.89608 1.27715
48 -1.44941 1 -1.62529 1.345439
49 -0.96057 0 -0.45214 1.140571
50 0.444845 0 1.590502 -0.73739
51 0.38374 1 1.272986 -1.18127
52 -0.89947 1 -0.31887 0.6284
53 1.666946 0 0.219714 0.594256
54 -0.1051 1 -0.54024 1.140571
55 0.62816 0 0.895296 -1.38614
56 -0.83836 1 -1.07086 0.150375
57 1.544736 0 0.991532 0.935703
58 0.200425 1 1.382263 0.047941
59 0.567055 0 0.900244 0.355243
60 -1.51052 1 -0.31556 0.730834
61 -1.51052 0 0.089596 1.311294
62 0.689265 1 1.191091 -0.60081
63 -1.20499 1 -1.00439 -0.9764
64 0.811475 1 -0.49021 -0.73739
65 0.933685 0 -0.13022 -0.39594
66 1.666946 1 1.458791 -1.35199
67 -0.47173 0 0.808243 1.106426
68 -0.59394 1 -1.65272 -0.87397
69 -1.2661 0 0.160127 -1.14712
70 0.99479 1 -1.57871 0.969847
71 1.055895 1 0.006527 0.491822
72 0.38374 1 1.425245 -1.07883
73 -1.44941 1 -1.5244 1.550307
74 -1.44941 1 0.076094 1.003992
75 0.87258 0 1.469484 -1.591
76 0.99479 1 0.057853 0.6284
77 0.62816 1 1.224847 0.252809
78 0.811475 0 1.26548 -1.38614
79 -0.83836 1 0.93031 -1.65929
80 1.300316 1 1.21038 -1.83002
81 -0.41063 1 0.245712 -1.55686
82 0.13932 0 1.358236 0.867413
83 -1.51052 1 -1.70417 0.867413
84 -0.65505 1 0.917436 -0.02035
85 0.62816 0 0.279846 -1.45443
86 -0.77726 1 -0.47771 -1.69344
87 1.055895 0 -0.3619 -0.49837
88 -0.28842 1 -0.61731 1.311294
89 -0.83836 1 -0.10455 -0.29351
90 -1.2661 0 0.757588 -0.94226
91 -1.14389 0 1.580438 0.69669
92 -1.2661 1 1.149829 -0.05449
93 1.361421 1 -1.48989 -0.15693
94 1.666946 1 -0.24801 1.27715
95 -1.3272 0 0.041373 -0.08864
96 -0.77726 0 0.65653 1.550307
97 0.689265 1 0.684247 -1.21541
98 1.666946 1 -0.80647 -0.6691
99 0.933685 1 0.460367 -0.08864
100 -0.65505 1 0.545575 -0.80568

Tasks:

1. Load the dataset and display the first few rows to understand its structure.

2. Check for missing values and handle them appropriately.

3. Encode categorical variables using one-hot encoding or label encoding.

4. Normalize numerical variables to bring them to a similar scale.

5. Split the dataset into independent variables (features) and the dependent variable
(target).

6. Split the dataset into training and testing sets.

Algorithm, Code and Output graphs:

ALGORITHM:

1. Load the Dataset:


Read the dataset containing information about customers.

2. Display Dataset Structure:


Display the first few rows of the dataset to understand its structure, including
columns like Customer ID, Age, Gender, Income, and Spending Score.

3. Check for Missing Values:


Identify any missing values in the dataset and handle them appropriately. This
may involve imputation techniques such as mean, median, or mode
imputation, or dropping rows/columns with missing values depending on the
context.

4. Encode Categorical Variables:


Encode categorical variables using techniques like one-hot encoding or label
encoding. For example, if the Gender column contains categorical values like
'Male' and 'Female', encode them into numerical representations.

5. Normalize Numerical Variables:


Normalize numerical variables to bring them to a similar scale. This can help
improve the performance of machine learning models. Common normalization
techniques include Min-Max scaling or z-score scaling.

6. Split Dataset:
Separate the dataset into independent variables (features) and the dependent
variable (target). The features will include columns like Age, Gender, Income,
and Spending Score, while the target variable may be a specific column (e.g.,
Spending Score) that we want to predict or analyze.

7. Split Dataset into Training and Testing Sets:

Further split the dataset into training and testing sets. The training set will be
used to train machine learning models, while the testing set will be used to
evaluate their performance.

8. Output Graphs and Results:


Optionally, visualize the data or any preprocessing transformations performed.
This could include histograms of numerical variables, bar plots of categorical
variables, or scatter plots to explore relationships between variables.

CODE AND OUTPUT:

1. Load the dataset and display the first few rows to understand its structure.

CODE:

from sklearn.model_selection import train_test_split

from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder

OUTPUT:
2. Check for missing values and handle them appropriately.

CODE:

# Check for missing values

print(df.isnull().sum())

OUTPUT:

3. Encode categorical variables using one-hot encoding or label encoding.


CODE:

label_encoder = LabelEncoder()
df['Gender'] = label_encoder.fit_transform(df['Gender'])

4. Normalize numerical variables to bring them to a similar scale.

CODE:

scaler = StandardScaler()

numerical_cols = ['Age', 'Income', 'Spending Score']

df[numerical_cols] = scaler.fit_transform(df[numerical_cols])
5. Split the dataset into independent variables (features) and the dependent variable
(target).

CODE:

X = df.drop(columns=['Customer ID'])

y = df['Customer ID']

6. Split the dataset into training and testing sets.

CODE:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,


random_state=42)

Outcome:

This lab demonstrates common data preprocessing activities such as handling missing values,
encoding categorical variables, normalizing numerical variables, and splitting the dataset into
training and testing sets using Python. It ensures that the data is ready for further analysis or
modeling tasks.

Lab 3: Linear and Multiple Linear


Regression
Objective:

The objective of this lab is to implement Linear and Multiple Linear Regression on a given
dataset using Python. Linear regression is a simple, yet powerful technique used for
predicting a continuous target variable based on one or more independent variables.

Dataset:

The dataset for this assignment is a fictional dataset containing information about houses. It
includes the following columns:
• Area (in square feet)

• Number of bedrooms

• Number of bathrooms

• Price (in dollars)

Area Bedrooms Bathrooms Price


2153 4 2 222685.2
1335 4 2 154559.3
1263 3 3 137714
2231 4 3 247419.8
1533 5 3 171936.3
777 2 1 90589.14
2278 3 2 232003.7
2328 4 3 254563.4
2862 2 2 304538.8
1205 3 2 145637.9
2635 2 2 267482.9
2722 5 2 282728.4
2201 3 1 235196.5
1037 4 3 128289.3
2620 1 3 261140.8
2440 4 1 264294
2663 3 2 283769.5
1476 4 2 152904.1
1255 1 1 142734.3
2546 1 1 258275.7
2371 1 1 237051
2996 4 3 312152.4
599 3 2 75671.36
2508 4 3 261428.9
1255 1 2 132452.5
1297 5 3 147189.6
1159 1 1 129346.3
923 1 3 112386.4
1139 3 1 132724.2
1044 4 2 106590.5
1214 3 2 134299.4
2792 4 2 286137.6
651 1 1 70622.25
1707 1 1 162284.2
2576 1 1 276429.4
1302 4 1 145069.1
2676 1 1 259275.5
2676 3 1 281638.2
2456 3 2 267929.2
2425 1 1 246277.8
1256 5 3 162926.4
773 4 3 93642.67
2883 5 2 322937.8
888 1 2 95563.09
2141 5 2 242304
1966 4 3 213534.1
1388 4 2 158737.5
757 5 2 85060.21
1845 2 1 185425.8
2605 4 2 258856.3
2839 1 3 277562.6
930 1 2 94844.36
591 1 2 57553.66
2420 2 1 246512
2946 1 1 299049.1
2089 5 2 227720.7
584 2 2 49456.82
2751 4 3 300483.1
824 2 2 88236.88
1274 1 3 151969.3
1571 1 3 156123.4
1139 5 2 113129.5
2536 4 3 280712.3
1655 4 3 175218.7
1472 2 1 152027.6
1704 1 2 183700.2
2524 1 2 241440.5
1667 2 2 179946.9
2184 3 3 230123.3
873 1 2 100330.7
2377 4 2 261755.5
1060 2 2 121746.9
1829 2 3 211466.8
2105 5 1 219310.8
2717 1 2 275602.8
2199 1 1 236063.7
1972 4 2 208690.2
756 3 2 83447.72
2405 4 3 251109.1
1816 3 1 215577.7
2454 5 3 234348.7
1316 4 1 133860
2935 4 3 283998.1
2134 1 1 229640.1
1473 4 3 162163.6
868 1 1 103118.3
2523 5 1 265672.3
701 3 1 92127.5
2931 4 1 306599
2036 5 2 213141.3
2430 1 3 258712
2918 3 1 296819
1055 4 1 106907.8
1454 4 2 164808.7
2571 2 3 269317.5
2223 4 2 254637.9
630 5 1 86246.93
2925 4 1 302683.3
2646 4 1 249324.6
1431 2 1 146535.4

Tasks:

1. Load the dataset and display the first few rows to understand its structure.

2. Split the dataset into independent variables (features) and the dependent variable
(target).

3. Implement Linear Regression to predict house prices based on the area.

4. Evaluate the model's performance using metrics like Mean Squared Error (MSE) or
R-squared.

5. Implement Multiple Linear Regression to predict house prices based on the area,
number of bedrooms, and number of bathrooms.

6. Evaluate the model's performance using the same metrics as in step 4.

Algorithm, Code and Output graph:

Algorithm:

9. Load Dataset:
Examine the dataset having details about houses.

10. Display Dataset Information:


Show the first few rows of the dataset to understand the structure including
column names and data types.

11. Split Dataset:


• Split the dataset into independent variables (features) and the
dependent variable (target).
• Independent variables: 'Area', 'Number of bedrooms', 'Number of
bathrooms.
• Dependent variable: 'Price'

12. Implement Linear Regression:


• Use the 'Area' feature as the independent variable to predict house
prices.
• Split the dataset into training and testing sets.
• Train a Linear Regression model on the training data.
• Predict house prices using the test data.

13. Evaluate Linear Regression Model:


Calculate metrics like Mean Squared Error (MSE) or R-squared to analyse the
performance of the Linear Regression model.

14. Implement Multiple Linear Regression:


• Use multiple features ('Area', 'Number of bedrooms', 'Number of
bathrooms') as independent variables to predict house prices.
• Separate the dataset into training and testing sets.
• Train a Multiple Linear Regression model on the training data.
• Predict house prices using the test data.

15. Evaluate Multiple Linear Regression Model:


Calculate metrics like Mean Squared Error (MSE) or R-squared to evaluate
the performance of the Multiple Linear Regression model.

16. Display Model Performance:


Display the performance metrics (MSE, R-squared) for both Linear
Regression and Multiple Linear Regression models.

Code and Output:

1. Load the dataset and display the first few rows to understand its structure.

CODE:

import pandas as pd

import numpy as np
from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression

from sklearn.metrics import mean_squared_error, r2_score

#load the dataset

df = pd.read_csv("C:\ML Dataset(Umang)\fictional_house_data.csv")

print(df.head())

OUTPUT:

2. Split the dataset into independent variables (features) and the dependent variable
(target).

CODE:

X = df[['Area']]

y = df['Price']

3. Implement Linear Regression to predict house prices based on the area.

CODE:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,


random_state=42)

lin_reg = LinearRegression()

lin_reg.fit(X_train, y_train)

OUTPUT:
CODE:

# Predict house prices


y_pred = lin_reg.predict(X_test)

4. Evaluate the model's performance using metrics like Mean Squared Error (MSE) or
R-squared.

CODE:

# Evaluate the model's performance

mse = mean_squared_error(y_test, y_pred)

r2 = r2_score(y_test, y_pred)

print("Linear Regression Performance:")

print("Mean Squared Error (MSE):", mse)

print("R-squared:", r2)

OUTPUT:

5. Implement Multiple Linear Regression to predict house prices based on the area,
number of bedrooms, and number of bathrooms.

CODE:

X = df[['Area', 'Bedrooms', 'Bathrooms']]


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)

multiple_lin_reg = LinearRegression()

multiple_lin_reg.fit(X_train, y_train)

OUTPUT:

CODE:

# Predict house prices

y_pred_multiple = multiple_lin_reg.predict(X_test)

6. Evaluate the model's performance using the same metrics as in step 4.

CODE:

mse_multiple = mean_squared_error(y_test, y_pred_multiple)

r2_multiple = r2_score(y_test, y_pred_multiple)

print("Multiple Linear Regression Performance:")

print("Mean Squared Error (MSE):", mse_multiple)

print("R-squared:", r2_multiple)

OUTPUT:
Outcome:

This lab demonstrates the implementation of Linear and Multiple Linear Regression on a
dataset containing house-related information using Python. It includes splitting the dataset,
fitting the regression models, and evaluating their performance using Mean Squared Error
(MSE) and R-squared metrics.

Lab 4: Logistic Regression Analysis


Objective:

The objective of this lab is to implement Logistic Regression Analysis on a given dataset
using Python. Logistic Regression is a powerful technique used for binary classification
tasks, where the target variable has two possible outcomes.

Dataset:

The dataset for this assignment is a fictional dataset containing information about bank
customers. It includes the following columns:

• Age

• Income

• Credit Score

• Loan Status (1 for approved, 0 for not approved)

Age Income Credit Loan


Score Status
62 136939 394 1
65 24420 526 1
71 131366 569 0
18 125412 596 0
21 102991 628 1
77 82079 319 1
21 99860 710 0
57 123555 750 1
27 147948 548 1
37 27012 480 1
39 29396 623 1
68 132616 560 1
54 23918 439 1
41 29359 761 0
24 136372 828 0
42 64259 452 1
42 43482 449 0
30 35127 410 1
76 129263 325 0
19 121261 764 0
56 57237 417 1
57 99701 745 1
41 28752 528 1
64 148305 807 1
42 128101 421 1
35 100041 626 0
55 91331 769 1
43 70624 587 1
31 109183 825 1
26 60133 452 0
27 113790 341 0
38 126752 574 1
69 75153 738 1
34 122066 507 0
69 82756 466 1
23 144834 411 1
33 110928 649 1
65 101757 429 0
18 104355 516 0
36 119938 681 0
53 68682 324 1
42 86509 367 0
67 106384 534 0
69 95751 591 0
47 96693 332 0
37 44777 311 0
37 145311 512 0
32 33824 438 1
57 148906 721 1
50 22418 788 1
19 32843 839 0
27 98778 517 0
75 56223 607 1
50 140855 474 1
49 140507 448 1
28 81570 841 0
70 26521 847 1
41 108162 693 0
53 76894 373 1
29 94659 597 0
68 116990 694 1
73 149147 839 1
46 128483 729 1
52 141426 499 0
18 34254 723 1
18 71939 361 0
54 109236 641 0
71 57073 344 1
23 137298 388 0
56 123835 333 1
58 43310 532 0
70 51785 336 0
35 84570 759 1
33 100577 590 0
22 76356 497 0
59 43306 554 0
60 56950 380 0
76 133707 746 0
49 98696 436 0
74 112171 489 0
19 133410 429 0
19 118611 509 0
57 98928 668 1
59 99904 591 0
75 90838 676 1
53 53920 647 1
56 108870 468 0
73 126771 476 1
29 123598 837 1
64 132170 623 1
36 131987 808 0
45 135896 414 0
18 120796 329 1
32 112648 589 1
53 134253 446 1
71 132926 573 1
30 99128 521 0
75 110749 640 1
60 53538 809 0
38 101411 814 1

Tasks:

1. Load the dataset and display the first few rows to understand its structure.

2. Split the dataset into independent variables (features) and the dependent variable
(target).

3. Implement Logistic Regression to predict loan approval status based on age, income,
and credit score.

4. Evaluate the model's performance using metrics like accuracy, precision, recall, and
F1-score.
5. Optionally, visualize the ROC curve.

Algorithm, Code and Output graphs:

Algorithm:

1. Load Dataset:
Read the dataset containing information about bank customers.

2. Display Dataset Information:


Display the first few rows of the dataset to understand its structure, including column
names and data types.

3. Split Dataset:
• Separate the dataset into independent variables (features) and the dependent
variable (target).
• Independent variables: 'Age', 'Income', 'Credit Score'
• Dependent variable: 'Loan Status'

4. Implement Logistic Regression:


• Use the independent variables (features) to predict the loan approval status
based on age, income, and credit score.
• Split the dataset into training and testing sets.
• Train a Logistic Regression model on the training data.
• Predict loan approval status using the test data.

5. Evaluate Model's Performance:


• Calculate metrics like accuracy, precision, recall, and F1-score to evaluate the
performance of the Logistic Regression model.
• Visualize the ROC curve to assess the model's performance.

Code and Output:

1. Load the dataset and display the first few rows to understand its structure.

CODE:

import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score,


roc_curve, roc_auc_score
import matplotlib.pyplot as plt

# Load the dataset

df = pd.read_csv("C:\ML Dataset(Umang)\fictional_bank_customer_data.csv")

#Display the first few rows

print(df.head())

OUTPUT:

2. Split the dataset into independent variables (features) and the dependent variable
(target).

CODE:

X = df[['Age', 'Income', 'Credit Score']]

y = df['Loan Status']

3. Implement Logistic Regression to predict loan approval status based on age, income,
and credit score.

CODE:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,


random_state=42)

logreg_model = LogisticRegression()

logreg_model.fit(X_train, y_train)
OUTPUT:

CODE:

# Predict loan approval status using the test data

y_pred = logreg_model.predict(X_test)

4. Evaluate the model's performance using metrics like accuracy, precision, recall, and
F1-score.

CODE:

accuracy = accuracy_score(y_test, y_pred)

precision = precision_score(y_test, y_pred)

recall = recall_score(y_test, y_pred)

f1 = f1_score(y_test, y_pred)

print("Model Performance Metrics:")

print("Accuracy:", accuracy)

print("Precision:", precision)

print("Recall:", recall)

print("F1-score:", f1)

OUTPUT:
5. Optionally, visualize the ROC curve.

CODE:

y_prob = logreg_model.predict_proba(X_test)[:, 1]

fpr, tpr, thresholds = roc_curve(y_test, y_prob)

plt.figure(figsize=(8, 6))

plt.plot(fpr, tpr, color='blue', lw=2, label='ROC curve (AUC =


{:.2f})'.format(roc_auc_score(y_test, y_prob)))

plt.plot([0, 1], [0, 1], color='red', linestyle='--', lw=2)

plt.xlim([0.0, 1.0])

plt.ylim([0.0, 1.05])

plt.xlabel('False Positive Rate')

plt.ylabel('True Positive Rate')

plt.title('Receiver Operating Characteristic (ROC) Curve')

plt.legend(loc="lower right")

plt.show()

OUTPUT:
Outcome:

This solution demonstrates the implementation of Logistic Regression Analysis on a dataset


containing bank customer information using Python. It includes splitting the dataset, fitting
the logistic regression model, evaluating its performance using accuracy, precision, recall,
and F1-score metrics, and optionally visualizing the ROC curve.

Lab 5: Decision Tree


Objective:

The objective of this lab is to implement Decision Tree algorithm on a given dataset using
Python. Decision Trees are powerful machine learning models used for both classification
and regression tasks.

Dataset:

The dataset for this assignment is a fictional dataset containing information about patients. It
includes the following columns:

• Age
• Blood Pressure

• Cholesterol Level

• Heart Disease (1 for presence, 0 for absence)

Age Blood Cholesterol Heart


Pressure Level Disease
62 155 337 1
65 148 239 0
71 86 186 1
18 148 209 0
21 127 284 1
77 83 116 1
21 156 252 0
57 132 249 0
27 158 210 1
37 95 125 1
39 100 217 0
68 179 183 1
54 138 261 1
41 103 328 1
24 159 351 1
42 93 221 1
42 165 387 1
30 128 113 0
76 129 284 0
19 149 252 1
56 121 179 0
57 115 141 1
41 144 374 0
64 175 140 0
42 149 307 1
35 174 367 1
55 80 266 1
43 130 211 1
31 116 229 1
26 114 323 0
27 128 316 1
38 173 124 1
69 83 167 1
34 178 359 0
69 122 334 0
23 157 304 1
33 101 391 1
65 153 314 0
18 80 289 1
36 90 297 1
53 123 315 1
42 138 143 0
67 103 132 0
69 139 111 0
47 82 204 1
37 178 312 0
37 142 238 0
32 115 282 0
57 174 225 0
50 147 256 0
19 162 211 0
27 126 358 1
75 179 127 1
50 100 317 1
49 161 251 0
28 130 274 0
70 107 248 1
41 94 129 1
53 121 167 1
29 138 135 0
68 145 395 1
73 116 173 0
46 90 397 1
52 166 318 0
18 123 359 1
18 91 387 1
54 82 365 1
71 131 127 0
23 160 299 1
56 112 161 0
58 134 144 0
70 80 390 1
35 118 188 0
33 99 133 1
22 126 233 0
59 122 332 0
60 136 355 1
76 140 136 0
49 157 356 0
74 110 390 0
19 104 297 0
19 82 354 0
57 83 180 0
59 174 236 0
75 178 289 0
53 93 229 0
56 120 309 1
73 152 391 0
29 99 268 1
64 175 392 1
36 152 276 0
45 106 125 1
18 146 391 1
32 132 214 1
53 147 386 0
71 141 129 0
30 94 341 1
75 176 389 1
60 84 246 1
38 147 373 1

Tasks:

1. Load the dataset and display the first few rows to understand its structure.

2. Split the dataset into independent variables (features) and the dependent variable
(target).

3. Implement Decision Tree Classifier to predict the presence of heart disease based on
age, blood pressure, and cholesterol level.

4. Evaluate the model's performance using metrics like accuracy, precision, recall, and
F1-score.

5. Optionally, visualize the decision tree.

Algorithm, Code and Output graphs:

Algorithm:

1. Load Dataset:
Read the dataset containing information about patients.

2. Display Dataset Information:


Display the first few rows of the dataset to understand its structure, including column
names and data types.

3. Split Dataset:
• Separate the dataset into independent variables (features) and the dependent
variable (target).
• Independent variables: 'Age', 'Blood Pressure', 'Cholesterol Level'
• Dependent variable: 'Heart Disease'

4. Implement Decision Tree Classifier:


• Use the independent variables (features) to predict the presence of heart
disease based on age, blood pressure, and cholesterol level.
• Split the dataset into training and testing sets.
• Train a Decision Tree Classifier on the training data.
• Predict the presence of heart disease using the test data.

5. Evaluate Model's Performance:


• Calculate metrics like accuracy, precision, recall, and F1-score to evaluate the
performance of the Decision Tree Classifier model.

6. Optionally, Visualize the Decision Tree:


• Visualize the decision tree to understand how the model makes decisions.
• Plot the decision tree using libraries like scikit-learn's plot_tree function.

Code and Output:

1. Load the dataset and display the first few rows to understand its structure.

CODE:

import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.tree import DecisionTreeClassifier, plot_tree

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

import matplotlib.pyplot as plt

# Load the dataset

df = pd.read_csv("C:\ML Dataset(Umang)\fictional_patient_data.csv")

# Display the first few rows

print(df.head())

OUTPUT:
2. Split the dataset into independent variables (features) and the dependent variable
(target).

CODE:

X = df[['Age', 'Blood Pressure', 'Cholesterol Level']]


y = df['Heart Disease']

3. Implement Decision Tree Classifier to predict the presence of heart disease based on
age, blood pressure, and cholesterol level.

CODE:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,


random_state=42)

dt_classifier = DecisionTreeClassifier()

dt_classifier.fit(X_train, y_train)

OUTPUT:

CODE:

# Predict the presence of heart disease using the test data


y_pred = dt_classifier.predict(X_test)
4. Evaluate the model's performance using metrics like accuracy, precision, recall, and
F1-score.

CODE:

accuracy = accuracy_score(y_test, y_pred)

precision = precision_score(y_test, y_pred)

recall = recall_score(y_test, y_pred)

f1 = f1_score(y_test, y_pred)

print("Model Performance Metrics:")

print("Accuracy:", accuracy)

print("Precision:", precision)

print("Recall:", recall)

print("F1-score:", f1)

OUTPUT:

5. Optionally, visualize the decision tree.


CODE:

plt.figure(figsize=(12, 8))

plot_tree(dt_classifier, feature_names=X.columns.tolist(), class_names=['No Heart


Disease', 'Heart Disease'], filled=True)

plt.show()

OUTPUT:
Outcome:

This lab demonstrates the implementation of Decision Tree Classifier on a dataset containing
patient information using Python. It includes splitting the dataset, fitting the decision tree
model, evaluating its performance using accuracy, precision, recall, and F1-score metrics, and
optionally visualizing the decision tree.

Lab 6: Naïve Bayes


Objective:

The objective of this lab is to implement the Naïve Bayes algorithm on a given dataset using
Python. Naïve Bayes is a probabilistic classifier based on Bayes' theorem with the
assumption of independence between features.

Dataset:

The dataset for this assignment is a fictional dataset containing information about emails. It
includes the following columns:

• Email Text

• Spam (1 for spam, 0 for not spam)

Email Text Spam


spam word spam spam word spam spam word spam word 0
word spam spam word spam spam word spam spam word 1
word word word word spam word word spam word spam 1
spam word spam spam word word word spam word spam 1
word word word word spam word word spam spam spam 0
spam word word spam word spam word spam spam word 0
word word spam word word spam spam word word word 1
word word word spam word spam spam word spam word 0
spam spam word word word word spam word spam word 1
word word word spam word spam spam spam word spam 0
spam spam word word word word spam spam spam word 0
word word word word spam word word spam spam word 0
word spam word word spam spam spam spam spam spam 1
word word spam spam word spam word spam spam spam 1
word spam word word spam word spam spam word spam 1
spam word word word word spam spam word word word 0
spam word word spam spam spam spam word spam word 0
word word word word word word word spam spam word 1
word word word word spam word spam spam spam word 0
word word spam spam spam word spam word word spam 0
spam word word spam spam spam word word spam word 0
spam word spam spam spam spam word spam word word 1
word spam word spam spam word word spam word word 1
word word spam word word word spam word spam spam 0
spam word spam spam spam word word spam word word 0
word word spam word spam word spam spam spam spam 1
word word spam word word spam spam word spam word 0
spam spam spam word word spam spam word word spam 0
word word word word word spam spam word word word 0
spam word spam spam spam word word spam spam word 0
word spam word spam spam spam word spam word spam 1
word spam word spam spam spam word word word spam 1
spam word word word spam word word spam spam spam 0
word spam spam spam word word spam spam spam word 0
word word spam word word word spam word spam spam 0
word spam word word spam spam spam spam word spam 1
spam spam word spam spam spam spam spam word spam 0
spam spam spam spam word word spam spam word spam 0
spam spam word word spam spam word word spam word 1
word spam spam spam word word spam word spam spam 0
word word word spam word spam spam spam spam spam 0
spam word spam word spam word word spam spam word 0
word spam word spam word word spam word spam word 0
word spam word word word word word word word word 1
spam word spam spam word word word spam spam spam 1
word word word spam word word word word spam spam 0
spam word spam word spam spam spam word word word 1
word word word spam spam word spam word spam word 0
spam spam word spam spam word spam word word word 1
spam word word word spam spam spam spam word word 0
word word spam spam spam spam word spam spam word 0
word word spam word word word spam word spam word 1
word spam word word word spam spam spam spam spam 1
spam word spam spam word spam spam word word word 0
spam word spam word word word spam spam word spam 0
spam word word spam spam word word word spam word 1
word word spam spam spam spam spam word spam spam 0
word word spam word spam word spam word word spam 0
word word spam spam word spam spam spam word spam 1
word word word word spam spam word spam word spam 1
word word word spam word spam spam spam spam spam 1
word word word spam word spam word spam spam word 0
word spam spam spam spam spam word spam word spam 1
spam spam word spam spam spam word word spam word 0
spam word spam word spam word word spam spam spam 1
spam spam word spam word spam spam spam spam spam 0
spam word spam word word spam spam word word spam 0
spam spam spam word word word word spam spam word 0
spam word spam word spam word spam spam spam word 0
spam word word word word spam spam spam spam spam 1
spam word word word word word spam spam word spam 0
spam word spam spam spam word word spam word word 1
word spam word word spam word spam word spam word 0
spam spam word word word word word word word spam 0
word word word word spam spam spam spam word word 1
spam spam word spam spam spam word word spam spam 0
spam word spam spam spam spam word word word spam 0
spam word word spam spam spam word spam spam word 0
word word spam spam word spam spam word spam spam 1
spam word word spam word spam word spam spam spam 1
word word word word spam spam spam word word spam 1
word word word spam word spam spam word word spam 1
spam word spam word word spam word spam spam spam 1
spam word spam spam spam word spam spam spam word 0
spam word word spam word spam spam word spam word 1
word word word word spam word spam spam spam spam 0
spam spam spam word spam spam word word spam spam 0
word word word spam word word word word spam word 1
word spam spam spam word word spam spam spam spam 1
spam spam word spam word word word spam word word 0
spam spam word spam spam word spam spam word word 1
word spam spam spam word spam word spam word spam 0
word spam word word word spam spam spam word spam 1
spam word word word spam spam word word spam word 0
spam word word word word spam word word spam spam 1
word word word spam spam spam word spam word spam 1
spam word spam word spam spam word spam spam word 0
word spam word spam word spam spam spam spam word 1
spam spam word spam spam spam word word word spam 0

Tasks:

1. Load the dataset and display the first few rows to understand its structure.
2. Split the dataset into independent variables (features) and the dependent variable
(target).

3. Preprocess the email text data (e.g., remove punctuation, convert to lowercase,
tokenize).

4. Implement Naïve Bayes Classifier to classify emails as spam or not spam based on
their text content.

5. Evaluate the model's performance using metrics like accuracy, precision, recall, and
F1-score.

Algorithm, Code and Output graphs:

Algorithm:
1. Load Dataset:
Read the dataset containing information about emails.
2. Display Dataset Information:
Display the first few rows of the dataset to understand its structure, including column
names and data types.

3. Split Dataset:
• Separate the dataset into independent variables (features) and the dependent
variable (target).
• Independent variable: 'Email Text'
• Dependent variable: 'Spam'

4. Preprocess Email Text Data:


Preprocess the email text data by removing punctuation, converting to lowercase, and
tokenizing.

5. Implement Naïve Bayes Classifier:


• Use the preprocessed email text data to classify emails as spam or not spam.
• Split the dataset into training and testing sets.
• Train a Naïve Bayes Classifier on the training data.
• Predict spam or not spam using the test data.

6. Evaluate Model's Performance:


Calculate metrics like accuracy, precision, recall, and F1-score to evaluate the
performance of the Naïve Bayes Classifier model.

Code and Output:

1. Load the dataset and display the first few rows to understand its structure.

CODE:

import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.feature_extraction.text import CountVectorizer

from sklearn.naive_bayes import MultinomialNB

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Load the dataset

df = pd.read_csv("C:\ML Dataset(Umang)\fictional_email_data.csv")
# Display the first few rows

print(df.head())

OUTPUT:

2. Split the dataset into independent variables (features) and the dependent variable
(target).

CODE:

X = df['Email Text']

y = df['Spam']

3. Preprocess the email text data (e.g., remove punctuation, convert to lowercase,
tokenize).

CODE:

vectorizer = CountVectorizer()

X = vectorizer.fit_transform(X)

4. Implement Naïve Bayes Classifier to classify emails as spam or not spam based on
their text content.

CODE:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,


random_state=42)

nb_classifier = MultinomialNB()

nb_classifier.fit(X_train, y_train)
OUTPUT:

CODE:

# Predict spam or not spam using the test data


y_pred = nb_classifier.predict(X_test)

5. Evaluate the model's performance using metrics like accuracy, precision, recall, and
F1-score.

CODE:

accuracy = accuracy_score(y_test, y_pred)

precision = precision_score(y_test, y_pred)

recall = recall_score(y_test, y_pred)

f1 = f1_score(y_test, y_pred)

print("Model Performance Metrics:")

print("Accuracy:", accuracy)

print("Precision:", precision)

print("Recall:", recall)

print("F1-score:", f1)
OUTPUT:

Outcome:

This lab demonstrates the implementation of Naïve Bayes Classifier on a datas et containing
email text information using Python. It includes preprocessing the text data, fitting the Naïve
Bayes model, and evaluating its performance using accuracy, precision, recall, and F1 -score
metrics.

Lab 7: KNN (k-Nearest Neighbor)


Algorithm
Objective:

The objective of this assignment is to implement the k-Nearest Neighbor (KNN) algorithm on
a given dataset using Python. KNN is a simple and intuitive classification algorithm that
classifies data points based on the majority class of their nearest neighbors.

Dataset: The dataset for this assignment is a fictional dataset containing information about
fruits. It includes the following columns:

• Color

• Diameter

• Label (type of fruit)

Color Diameter Label


Red 6.764052 Apple
Green 5.400157 Apple
Green 5.978738 Banana
Yellow 7.240893 Orange
Green 6.867558 Banana
Yellow 4.022722 Banana
Yellow 5.950088 Apple
Green 4.848643 Apple
Yellow 4.896781 Banana
Yellow 5.410599 Orange
Red 5.144044 Apple
Green 6.454274 Orange
Green 5.761038 Orange
Green 5.121675 Banana
Yellow 5.443863 Banana
Green 5.333674 Banana
Green 6.494079 Orange
Green 4.794842 Apple
Yellow 5.313068 Apple
Red 4.145904 Banana
Green 2.44701 Apple
Red 5.653619 Orange
Green 5.864436 Orange
Green 4.257835 Apple
Yellow 7.269755 Orange
Red 3.545634 Orange
Yellow 5.045759 Orange
Red 4.812816 Banana
Yellow 6.532779 Banana
Red 6.469359 Apple
Yellow 5.154947 Apple
Red 5.378163 Apple
Red 4.112214 Orange
Red 3.019204 Orange
Red 4.652088 Banana
Green 5.156349 Banana
Yellow 6.230291 Apple
Red 6.20238 Apple
Red 4.612673 Orange
Green 4.697697 Banana
Yellow 3.951447 Banana
Green 3.579982 Orange
Red 3.29373 Apple
Red 6.950775 Banana
Red 4.490348 Banana
Red 4.561926 Banana
Red 3.747205 Banana
Green 5.77749 Banana
Red 3.386102 Orange
Yellow 4.78726 Orange
Yellow 4.104533 Orange
Yellow 5.386902 Apple
Red 4.489195 Banana
Red 3.819368 Orange
Yellow 4.971818 Banana
Red 5.428332 Banana
Yellow 5.066517 Banana
Yellow 5.302472 Banana
Red 4.365678 Apple
Green 4.637259 Orange
Green 4.32754 Orange
Green 4.640447 Apple
Red 4.186854 Banana
Red 3.273717 Banana
Green 5.177426 Apple
Red 4.598219 Apple
Red 3.369802 Apple
Green 5.462782 Orange
Yellow 4.092702 Banana
Yellow 5.051945 Orange
Red 5.729091 Banana
Yellow 5.128983 Orange
Green 6.139401 Apple
Green 3.765174 Orange
Yellow 5.402342 Apple
Green 4.31519 Banana
Yellow 4.129203 Banana
Yellow 4.42115 Banana
Green 4.688447 Apple
Yellow 5.056165 Apple
Red 3.83485 Apple
Red 5.900826 Apple
Yellow 5.465662 Apple
Green 3.463756 Apple
Green 6.488252 Banana
Yellow 6.895889 Apple
Yellow 6.17878 Orange
Green 4.820075 Orange
Red 3.929247 Banana
Yellow 6.054452 Banana
Red 4.596823 Banana
Yellow 6.222445 Orange
Red 5.208275 Banana
Yellow 5.976639 Banana
Green 5.356366 Apple
Yellow 5.706573 Banana
Green 5.0105 Orange
Red 6.78587 Banana
Green 5.126912 Banana
Green 5.401989 Apple
Tasks:

1. Load the dataset and display the first few rows to understand its structure.

2. Split the dataset into independent variables (features) and the dependent variable
(target).

3. Encode categorical variables using one-hot encoding or label encoding if necessary.

4. Implement KNN Classifier to classify fruits based on color and diameter.

5. Evaluate the model's performance using metrics like accuracy, precision, recall, and
F1-score.

Algorithm, Code and Output graphs:

Algorithm:

1. Load the Dataset:


Read the dataset containing information about fruits.

2. Display Dataset Structure:


Display the first few rows of the dataset to understand its structure, including columns
like Color, Diameter, and Label (type of fruit).

3. Split Dataset:
Separate the dataset into independent variables (features), which include Color and
Diameter, and the dependent variable (target), which is the Label (type of fruit).

4. Encode Categorical Variables:


If necessary, encode categorical variables using techniques like one-hot encoding or
label encoding. In this case, encode the Color column using label encoding to convert
categorical values into numerical representations.

5. Implement KNN Classifier:


Use the K-Nearest Neighbor (KNN) algorithm to classify fruits based on their Color
and Diameter features. Train the KNN classifier using the training data.

6. Evaluate Model's Performance:


Predict fruit labels using the test data. Evaluate the model's performance using
metrics like accuracy, precision, recall, and F1-score.

7. Display Model Performance Metrics:


Print out the calculated performance metrics such as accuracy, precision, recall, and
F1-score to assess the effectiveness of the KNN classifier.
Code and Output:

1. Load the dataset and display the first few rows to understand its structure.

CODE:

import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.preprocessing import LabelEncoder

from sklearn.neighbors import KNeighborsClassifier

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Load the dataset

df = pd.read_csv("C:\ML Dataset(Umang)\fictional_fruit_data.csv")

# Display the first few rows

print(df.head())

OUTPUT:

2. Split the dataset into independent variables (features) and the dependent variable
(target).

CODE:

X = df[['Color', 'Diameter']]

y = df['Label']
3. Encode categorical variables using one-hot encoding or label encoding if necessary.

CODE:

label_encoder = LabelEncoder()
X['Color'] = label_encoder.fit_transform(X['Color'])

4. Implement KNN Classifier to classify fruits based on color and diameter.

CODE:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,


random_state=42)

knn_classifier = KNeighborsClassifier()

knn_classifier.fit(X_train, y_train)

OUTPUT:

CODE:
# Predict fruit label using the test data
y_pred = knn_classifier.predict(X_test)

5. Evaluate the model's performance using metrics like accuracy, precision, recall, and
F1-score.

CODE:

accuracy = accuracy_score(y_test, y_pred)

precision = precision_score(y_test, y_pred, average='weighted')


recall = recall_score(y_test, y_pred, average='weighted')

f1 = f1_score(y_test, y_pred, average='weighted')

print("Model Performance Metrics:")

print("Accuracy:", accuracy)

print("Precision:", precision)

print("Recall:", recall)

print("F1-score:", f1)

OUTPUT:

Outcome:

This lab demonstrates the implementation of KNN Classifier on a dataset containing fruit
information using Python. It includes splitting the dataset, encoding categorical variables,
fitting the KNN model, and evaluating its performance using accuracy, precision, recall, and F1-
score metrics.

Lab 8: SVM (Support Vector


Machine)
Objective:
The objective of this lab is to implement the Support Vector Machine (SVM) algorithm on a
given dataset using Python. SVM is a powerful supervised learning algorithm used for both
classification and regression tasks.

Dataset: The dataset for this assignment is a fictional dataset containing information about
flowers. It includes the following columns:

• Sepal Length

• Sepal Width

• Petal Length

• Petal Width

• Species (type of flower)

Sepal Sepal Petal Petal Species


Length Width Length Width
7.304163 3.859755 3.11024 0.20704 Versicolor
6.17213 2.470464 3.338693 2.460179 Setosa
6.652353 2.503691 5.695401 1.110195 Virginica
7.699941 3.466841 4.913264 0.683065 Versicolor
7.390073 2.545557 4.886631 1.706451 Setosa
5.028859 3.885757 0.914157 0.849853 Versicolor
6.628573 2.872144 3.717186 0.185964 Virginica
5.714374 2.728594 2.461066 0.176495 Versicolor
5.754328 3.876865 4.252667 1.727268 Virginica
6.180797 3.686621 3.587255 1.078724 Virginica
5.959556 3.85305 5.361915 1.098387 Versicolor
7.047047 3.439599 4.318304 2.019085 Virginica
6.471661 2.679673 5.143937 0.343612 Virginica
5.94099 3.871328 2.939102 0.644685 Virginica
6.208406 2.934759 2.097775 0.907491 Virginica
6.11695 3.395056 3.038313 1.271707 Versicolor
7.080086 3.457318 3.730044 1.16795 Virginica
5.669719 2.983346 4.427307 0.981966 Virginica
6.099846 3.314054 7.736384 1.153164 Virginica
5.131101 3.446549 3.685627 1.118448 Virginica
3.721018 3.211863 2.077537 0.653101 Setosa
6.382503 2.577258 3.151072 0.582125 Versicolor
6.557482 3.178242 2.944071 1.408632 Versicolor
5.224003 3.620346 4.607407 0.522905 Setosa
7.723896 2.751336 1.048197 0.32041 Versicolor
4.632876 2.985657 3.871341 0.962658 Setosa
5.87798 2.862884 4.035452 1.080173 Versicolor
5.684637 3.845183 4.168639 2.91511 Versicolor
7.112207 3.339087 2.708724 0.664428 Virginica
7.059568 3.225209 3.341258 1.916878 Setosa
5.968606 2.718936 1.253653 1.767863 Setosa
6.153875 3.281877 2.891757 0.296402 Versicolor
5.103138 2.760037 2.804564 1.787672 Setosa
4.195939 3.063687 4.492248 0.300251 Virginica
5.551233 2.776586 1.725119 -0.82097 Versicolor
5.96977 3.340866 5.134909 1.660803 Versicolor
6.861141 3.297934 6.390293 -0.13448 Virginica
6.837975 2.960432 0.116826 1.54271 Setosa
5.518519 3.220283 4.510215 0.680152 Setosa
5.589089 2.579984 4.951358 2.461259 Versicolor
4.969701 2.408759 2.638111 2.012067 Versicolor
4.661385 3.238938 3.060802 0.855427 Virginica
4.423796 3.12167 3.52613 0.677243 Setosa
7.459144 3.323064 3.235888 0.277301 Setosa
5.416989 4.074752 3.216137 0.864899 Setosa
5.476398 3.456126 0.810233 0.98693 Virginica
4.80018 2.657486 5.788104 0.922833 Virginica
6.485317 3.530317 5.660129 1.319095 Versicolor
4.500465 2.48416 2.328479 1.639676 Versicolor
5.663426 2.851519 1.179093 1.465737 Versicolor
5.096763 3.020656 4.677074 0.619251 Versicolor
6.161129 3.786737 2.746613 0.107278 Setosa
5.416032 2.729755 4.009838 2.237044 Setosa
4.860075 2.694631 3.197982 0.676019 Setosa
5.816609 3.007665 4.977108 0.704257 Setosa
6.195515 2.764704 4.982758 0.803896 Virginica
5.895209 3.534453 2.482949 -0.20073 Virginica
6.091052 2.585629 1.325279 0.83674 Setosa
5.313513 2.556588 0.974028 0.835462 Versicolor
5.538925 2.861737 4.834268 1.671472 Setosa
5.281858 2.835846 1.667608 1.730827 Versicolor
5.541571 3.879699 2.868003 1.202866 Virginica
5.165089 3.458251 2.710487 1.908205 Virginica
4.407185 3.087647 3.667482 1.458373 Versicolor
5.987264 2.523063 0.352148 1.188082 Setosa
5.506522 3.413076 4.09225 1.322305 Setosa
4.486935 2.619907 4.682048 1.055103 Versicolor
6.224109 2.385748 3.915623 0.899914 Setosa
5.086942 3.560853 3.21284 0.996523 Setosa
5.883115 3.186285 3.931424 0.342711 Virginica
6.445145 3.445969 4.462322 1.413136 Versicolor
5.947056 3.187053 -1.11976 0.445226 Setosa
6.785703 3.418437 7.202406 1.83964 Virginica
4.815095 2.770059 4.446564 1.010411 Setosa
6.173944 2.605276 2.611761 1.237616 Setosa
5.271608 3.343086 3.071922 1.575316 Versicolor
5.117238 2.704534 4.628986 1.688919 Setosa
5.359555 2.753494 3.555657 0.006326 Virginica
5.581411 2.854121 0.185995 1.042753 Setosa
5.886617 3.057516 7.393507 1.868936 Versicolor
4.872926 2.897783 3.565448 -0.09056 Setosa
6.587686 2.458771 5.555504 1.494333 Setosa
6.2265 2.773244 2.541992 -0.51423 Versicolor
4.564918 2.093937 6.464024 0.422895 Setosa
7.075249 3.31885 4.263965 1.229359 Setosa
7.413588 2.361115 4.831565 -0.0591 Versicolor
6.818387 2.575115 1.920354 0.451012 Setosa
5.690662 3.072431 5.891616 0.081405 Virginica
4.951275 2.731988 4.97408 2.452583 Setosa
6.715195 3.713496 6.051249 1.324813 Setosa
5.505363 2.494072 2.654566 1.631141 Versicolor
6.854629 3.164832 2.913392 1.030767 Versicolor
6.012868 3.033108 7.814893 0.931392 Versicolor
6.65061 2.54772 1.894372 -0.02852 Versicolor
6.135784 3.275009 3.520729 0.978204 Setosa
6.426456 2.976235 5.760929 0.621266 Virginica
5.848715 3.38187 3.931996 1.852022 Virginica
7.322273 3.404107 4.785998 2.067237 Virginica
5.945337 3.980191 3.05697 2.3146 Versicolor
6.173651 3.624707 4.411298 1.847939 Setosa

Tasks:

1. Load the dataset and display the first few rows to understand its structure.

2. Split the dataset into independent variables (features) and the dependent variable
(target).

3. Encode categorical variables using one-hot encoding or label encoding if necessary.

4. Implement SVM Classifier to classify flowers based on sepal and petal characteristics.

5. Evaluate the model's performance using metrics like accuracy, precision, recall, and
F1-score.

Algorithm, Code and Output graphs:

Algorithm:

1. Load Dataset:
Read the dataset containing information about flowers.
2. Display Dataset Information:
Display the first few rows of the dataset to understand its structure, including column
names and data types.
3. Split Dataset:
• Separate the dataset into independent variables (features) and the dependent
variable (target).
• Independent variables: 'Sepal Length', 'Sepal Width', 'Petal Length', 'Petal
Width'
• Dependent variable: 'Species'
4. Encode Categorical Variables (if necessary):
• If the 'Species' column is categorical and not numerical, encode it using one-
hot encoding or label encoding.
5. Implement SVM Classifier:
• Use the independent variables (features) to predict the flower species using
Support Vector Machine Classifier.
• Split the dataset into training and testing sets.
• Train the SVM Classifier on the training data.
• Predict flower species using the test data.
6. Evaluate Model's Performance:
• Calculate metrics like accuracy, precision, recall, and F1-score to evaluate the
performance of the SVM Classifier.
• Display the evaluation metrics to assess how well the model performs in
classifying flowers based on sepal and petal characteristics.

Code and Output:

1. Load the dataset and display the first few rows to understand its structure.

CODE:

import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.preprocessing import LabelEncoder

from sklearn.svm import SVC

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Load the dataset

df = pd.read_csv("C:\ML Dataset(Umang)\fictional_flower_data.csv")
# Display the first few rows

print(df.head())

OUTPUT:

2. Split the dataset into independent variables (features) and the dependent variable
(target).

CODE:

X = df.drop(columns=['Species'])
y = df['Species']

3. Encode categorical variables using one-hot encoding or label encoding if necessary.

CODE:

label_encoder = LabelEncoder()

y = label_encoder.fit_transform(y)

4. Implement SVM Classifier to classify flowers based on sepal and petal characteristics.

CODE:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,


random_state=42)

svm_classifier = SVC(kernel='linear')

svm_classifier.fit(X_train, y_train)
OUTPUT:

# Predict flower species using the test data


y_pred = svm_classifier.predict(X_test)

5. Evaluate the model's performance using metrics like accuracy, precision, recall, and
F1-score.

CODE:

accuracy = accuracy_score(y_test, y_pred)

precision = precision_score(y_test, y_pred, average='weighted')

recall = recall_score(y_test, y_pred, average='weighted')

f1 = f1_score(y_test, y_pred, average='weighted')

print("Model Performance Metrics:")

print("Accuracy:", accuracy)

print("Precision:", precision)

print("Recall:", recall)

print("F1-score:", f1)

OUTPUT:
Outcome:

This lab demonstrates the implementation of SVM Classifier on a dataset containing flower
information using Python. It includes splitting the dataset, encoding categorical variables,
fitting the SVM model, and evaluating its performance using accuracy, precision, recall, and
F1-score metrics.

Lab 9: Neural Network and Multi-


layer perceptron
Objective:

The objective of this lab is to implement a Neural Network (NN) and Multi-layer Perceptron
(MLP) on a given dataset using Python. NN and MLP are powerful deep learning models
used for various machine learning tasks.

Dataset: The dataset for this assignment is a fictional dataset containing information about
bank customers. It includes the following columns:

• Age

• Income

• Credit Score

• Loan Status (1 for approved, 0 for not approved)

Age Income Credit Loan


Score Status
62 57237 721 1
65 99701 788 1
18 28752 839 1
21 91331 517 1
21 70624 607 0
57 60133 474 1
27 75153 448 0
37 82756 841 1
39 68682 847 0
68 86509 693 0
54 95751 373 0
41 96693 597 0
24 44777 694 1
42 33824 839 1
42 22418 729 1
30 32843 499 1
19 98778 723 1
56 56223 361 0
57 81570 641 1
41 26521 344 0
64 76894 388 1
42 94659 333 0
35 34254 532 1
55 71939 336 0
43 57073 759 1
31 43310 590 1
26 51785 497 0
27 84570 554 0
38 76356 380 0
69 43306 746 1
34 56950 436 1
69 98696 489 1
23 98928 429 0
33 99904 509 0
65 90838 668 1
18 53920 591 0
36 99128 676 0
53 53538 647 1
42 28286 468 0
67 38728 476 0
69 33640 837 0
47 88627 623 1
37 65663 808 0
37 23912 414 1
32 70586 329 0
57 92130 589 1
50 41752 446 0
19 79715 573 1
27 27997 521 1
50 73006 640 0
49 60800 809 0
28 35620 814 1
41 90381 369 0
53 74268 657 1
29 89069 622 0
68 88473 391 1
46 62512 641 0
52 39608 723 0
18 90557 714 0
18 37340 574 0
54 21913 343 1
23 72086 383 1
56 33429 733 0
58 33907 393 1
35 55489 474 0
33 30088 705 1
22 35588 501 1
59 58395 798 0
60 80155 780 1
49 58214 629 0
19 20469 840 0
19 63295 509 0
57 94253 316 0
59 54488 406 0
53 58040 464 1
56 36975 836 1
29 35912 491 0
64 48086 495 0
36 35115 607 0
45 99983 436 0
18 72573 393 1
32 84223 387 1
53 31052 460 0
30 35741 447 0
60 77368 755 1
38 63986 387 0
29 25103 313 1
22 55050 614 1
24 21740 420 1
22 75270 581 1
65 32579 726 1
21 91382 770 1
30 67805 584 1
54 40165 576 1
58 22775 624 1
32 50752 834 1
33 99464 527 1
38 91892 817 0
53 53930 556 0
41 66774 436 1

Tasks:

1. Load the dataset and display the first few rows to understand its structure.

2. Split the dataset into independent variables (features) and the dependent variable
(target).
3. Preprocess the data if necessary (e.g., scaling).

4. Implement a Neural Network model to predict loan approval status based on customer
information.

5. Evaluate the model's performance using metrics like accuracy, precision, recall, and
F1-score.

Algorithm:

1. Load Dataset:
Read the dataset containing information about bank customers.

2. Display Dataset Information:


Display the first few rows of the dataset to understand its structure, including column
names and data types.

3. Split Dataset:
• Separate the dataset into independent variables (features) and the dependent
variable (target).
• Independent variables: 'Age', 'Income', 'Credit Score'
• Dependent variable: 'Loan Status'

4. Preprocess the Data (if necessary):

• Check if data preprocessing is required, such as scaling numerical features.


5.Implement Neural Network Model:
• Use the independent variables (features) to predict the loan approval status based on
customer information.
• Split the dataset into training and testing sets.
• Define and compile a Neural Network model using libraries like TensorFlow or
Keras.
• Train the Neural Network model on the training data.
• Predict loan approval status using the test data.

6. Evaluate Model's Performance:


• Calculate metrics like accuracy, precision, recall, and F1-score to evaluate the
performance of the Neural Network model.
• Display the evaluation metrics to assess how well the model predicts loan approval
status based on customer information.

Code and Output graphs:

1. Load the dataset and display the first few rows to understand its structure.

CODE:
import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.preprocessing import StandardScaler

from sklearn.neural_network import MLPClassifier

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

#Load the dataset

df = pd.read_csv("C:\ML Dataset(Umang)\fictional_bank_customer_data.csv")

# Display the first few rows

print(df.head())

OUTPUT:

2. Split the dataset into independent variables (features) and the dependent variable
(target).

CODE:

X = df.drop(columns=['Loan Status'])

y = df['Loan Status']

3. Preprocess the data if necessary (e.g., scaling).

CODE:

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

4. Implement a Neural Network model to predict loan approval status based on customer
information.

CODE:

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2,


random_state=42)

mlp_classifier = MLPClassifier(hidden_layer_sizes=(100,), max_iter=1000)

mlp_classifier.fit(X_train, y_train)

OUTPUT:

CODE:
y_pred = mlp_classifier.predict(X_test)

5. Evaluate the model's performance using metrics like accuracy, precision, recall, and
F1-score.

CODE:

accuracy = accuracy_score(y_test, y_pred)

precision = precision_score(y_test, y_pred)

recall = recall_score(y_test, y_pred)

f1 = f1_score(y_test, y_pred)

print("\nModel Performance Metrics:")

print("Accuracy:", accuracy)

print("Precision:", precision)
print("Recall:", recall)

print("F1-score:", f1)

OUTPUT:

Outcome:

This lab demonstrates the implementation of a Neural Network (Multi-layer Perceptron) on a


dataset containing bank customer information using Python. It includes preprocessing the
data, fitting the MLP model, and evaluating its performance using accuracy, precision, recall,
and F1-score metrics.

Lab 10: k-Means Clustering


Objective:

The objective of this assignment is to implement the k-Means Clustering algorithm on a


given dataset using Python. k-Means is a popular unsupervised learning algorithm used for
clustering data into k distinct groups.

Dataset: The dataset for this assignment is a fictional dataset containing information about
customers. It includes the following columns:

• Customer ID

• Age

• Income

Customer Age Income


ID
1 62 136939
2 65 24420
3 71 131366
4 18 125412
5 21 102991
6 77 82079
7 21 99860
8 57 123555
9 27 147948
10 37 27012
11 39 29396
12 68 132616
13 54 23918
14 41 29359
15 24 136372
16 42 64259
17 42 43482
18 30 35127
19 76 129263
20 19 121261
21 56 57237
22 57 99701
23 41 28752
24 64 148305
25 42 128101
26 35 100041
27 55 91331
28 43 70624
29 31 109183
30 26 60133
31 27 113790
32 38 126752
33 69 75153
34 34 122066
35 69 82756
36 23 144834
37 33 110928
38 65 101757
39 18 104355
40 36 119938
41 53 68682
42 42 86509
43 67 106384
44 69 95751
45 47 96693
46 37 44777
47 37 145311
48 32 33824
49 57 148906
50 50 22418
51 19 32843
52 27 98778
53 75 56223
54 50 140855
55 49 140507
56 28 81570
57 70 26521
58 41 108162
59 53 76894
60 29 94659
61 68 116990
62 73 149147
63 46 128483
64 52 141426
65 18 34254
66 18 71939
67 54 109236
68 71 57073
69 23 137298
70 56 123835
71 58 43310
72 70 51785
73 35 84570
74 33 100577
75 22 76356
76 59 43306
77 60 56950
78 76 133707
79 49 98696
80 74 112171
81 19 133410
82 19 118611
83 57 98928
84 59 99904
85 75 90838
86 53 53920
87 56 108870
88 73 126771
89 29 123598
90 64 132170
91 36 131987
92 45 135896
93 18 120796
94 32 112648
95 53 134253
96 71 132926
97 30 99128
98 75 110749
99 60 53538
100 38 101411

Tasks:

1. Load the dataset and display the first few rows to understand its structure.

2. Preprocess the data if necessary (e.g., scaling).

3. Implement the k-Means Clustering algorithm to cluster customers based on their age
and income.

4. Determine the optimal number of clusters using the Elbow Method or Silhouette
Score.

5. Visualize the clusters in a scatter plot.

Algorithm, Code and Output graphs:

Algorithm:

1. Load Dataset:
Read the dataset containing information about customers.

2. Display Dataset Information:


Display the first few rows of the dataset to understand its structure, including column
names and data types.

3. Preprocess the Data:


Check if data preprocessing is required, such as scaling numerical features.

4. Implement k-Means Clustering:


• Use the 'Age' and 'Income' features to cluster customers based on their
characteristics.
• Choose the number of clusters (k) based on the Elbow Method or Silhouette
Score.
• Apply the k-Means Clustering algorithm to the data.

5. Determine the Optimal Number of Clusters:


• Use the Elbow Method or Silhouette Score to determine the optimal number of
clusters.
• For the Elbow Method, plot the within-cluster sum of squares (WCSS) against
the number of clusters and identify the elbow point.
• For the Silhouette Score, calculate the silhouette score for different values of k
and choose the one with the highest score.
6. Visualize the Clusters:
• Visualize the clusters in a scatter plot using the 'Age' and 'Income' features.
• Plot each cluster with a different color to distinguish them.

Code and Output:

1. Load the dataset and display the first few rows to understand its structure.

CODE:

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

from sklearn.preprocessing import StandardScaler

from sklearn.cluster import KMeans

from sklearn.metrics import silhouette_score

# Load the dataset

df = pd.read_csv("C:\ML Dataset(Umang)\clustering.csv")

# Display the first few rows

print(df.head())

OUTPUT:

2. Preprocess the data if necessary (e.g., scaling).


CODE:

scaler = StandardScaler()

scaled_data = scaler.fit_transform(df[['Age', 'Income']])

3. Implement the k-Means Clustering algorithm to cluster customers based on their age
and income.

4. Determine the optimal number of clusters using the Elbow Method or Silhouette
Score.

CODE:

# Implement the k-Means Clustering algorithm

# Determine the optimal number of clusters using the Silhouette Score

silhouette_scores = []

for k in range(2, 11):

kmeans = KMeans(n_clusters=k, random_state=0)

cluster_labels = kmeans.fit_predict(scaled_data)

silhouette_avg = silhouette_score(scaled_data, cluster_labels)

silhouette_scores.append(silhouette_avg)

# Choose the optimal number of clusters

optimal_k = np.argmax(silhouette_scores) + 2

# Fit k-Means with the optimal number of clusters

kmeans = KMeans(n_clusters=optimal_k, random_state=0)

cluster_labels = kmeans.fit_predict(scaled_data)

5. Visualize the clusters in a scatter plot.

CODE:
plt.figure(figsize=(10, 6))
plt.scatter(scaled_data[:, 0], scaled_data[:, 1], c=cluster_labels, cmap='viridis',
marker='o', edgecolors='k')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=300, c='red',
marker='*', label='Centroids')
plt.title('k-Means Clustering')
plt.xlabel('Scaled Age')
plt.ylabel('Scaled Income')
plt.legend()
plt.colorbar(label='Cluster')
plt.grid(True)
plt.show()

OUTPUT:

Outcome:
This solution demonstrates the implementation of k-Means Clustering on a dataset containing
customer information using Python. It includes preprocessing the data, determining the
optimal number of clusters using the Elbow Method and Silhouette Score, and visualizing the
clusters in a scatter plot.

You might also like