ML Lab FileDhruv
ML Lab FileDhruv
EXPERIMENT FILE
Submitted By:
Name: Dhruv Sharma
Course: BTech CSE (AI&ML Non-Honours)
SAP ID: 500107715
ROLL NO: R2142220916
Batch: 05
Lab 1: Exploratory Data Analysis
Objective:
The objective of this lab is to perform exploratory data analysis (EDA) on a given dataset.
EDA helps in understanding the structure of the data, identifying patterns, and gaining
insights that can be useful for further analysis.
Dataset:
The dataset for this assignment is a fictional dataset containing information about students. It
includes the following columns:
• Student ID
• Gender
• Age
• Study Hours
• Score
Tasks:
1. Load the dataset and display the first few rows to understand its structure.
3. Summarize the statistical properties of numerical variables (mean, median, min, max,
etc.).
ALGORITHM:
1. Load the dataset and display the first few rows to understand its structure.
INPUT:
import pandas as pd
import numpy as np
df = pd.read_csv("C:\ML Dataset(Umang)\fictional_student_data.csv")
print(df.head())
OUTPUT:
OUTPUT:
3. Summarize the statistical properties of numerical variables (mean, median, min, max,
etc.).
INPUT:
#Statisitical properties
print(df.describe())
OUTPUT:
INPUT:
plt.figure(figsize=(12, 6))
plt.subplot(2, 2, 1)
sns.histplot(df['Age'], kde=True)
plt.title('Age Distribution')
plt.subplot(2, 2, 2)
plt.subplot(2, 2, 3)
sns.histplot(df['Score'], kde=True)
plt.title('Score Distribution')
plt.tight_layout()
plt.show()
OUTPUT:
INPUT:
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
plt.subplot(1, 2, 2)
plt.title('Age vs Score')
plt.tight_layout()
plt.show()
OUTPUT:
6. Analyze the distribution of categorical variables using bar plots.
INPUT:
plt.figure(figsize=(10, 5))
plt.subplot(1, 2, 1)
sns.countplot(x='Gender', data=df)
plt.title('Gender Distribution')
plt.tight_layout()
plt.show()
OUTPUT:
INPUT:
OUTPUT:
INPUT:
numeric_df = df.select_dtypes(include=['number'])
plt.figure(figsize=(8, 6))
plt.title('Correlation Matrix')
plt.show()
OUTPUT:
Outcome:
This lab provides a comprehensive exploration of the dataset through various EDA
techniques in Python. It includes handling missing values, summarizing statistical properties,
visualizing distributions, analyzing relationships between variables, and explor ing
correlations.
The objective of this lab is to perform various data preprocessing activities on a given dataset
using Python. Data preprocessing is essential for cleaning, transforming, and preparing the
data for machine learning models.
Dataset:
The dataset for this assignment is a fictional dataset containing information about customers.
It includes the following columns:
• Customer ID
• Age
• Gender
• Income
• Spending Score
Tasks:
1. Load the dataset and display the first few rows to understand its structure.
5. Split the dataset into independent variables (features) and the dependent variable
(target).
ALGORITHM:
6. Split Dataset:
Separate the dataset into independent variables (features) and the dependent
variable (target). The features will include columns like Age, Gender, Income,
and Spending Score, while the target variable may be a specific column (e.g.,
Spending Score) that we want to predict or analyze.
Further split the dataset into training and testing sets. The training set will be
used to train machine learning models, while the testing set will be used to
evaluate their performance.
1. Load the dataset and display the first few rows to understand its structure.
CODE:
OUTPUT:
2. Check for missing values and handle them appropriately.
CODE:
print(df.isnull().sum())
OUTPUT:
label_encoder = LabelEncoder()
df['Gender'] = label_encoder.fit_transform(df['Gender'])
CODE:
scaler = StandardScaler()
df[numerical_cols] = scaler.fit_transform(df[numerical_cols])
5. Split the dataset into independent variables (features) and the dependent variable
(target).
CODE:
X = df.drop(columns=['Customer ID'])
y = df['Customer ID']
CODE:
Outcome:
This lab demonstrates common data preprocessing activities such as handling missing values,
encoding categorical variables, normalizing numerical variables, and splitting the dataset into
training and testing sets using Python. It ensures that the data is ready for further analysis or
modeling tasks.
The objective of this lab is to implement Linear and Multiple Linear Regression on a given
dataset using Python. Linear regression is a simple, yet powerful technique used for
predicting a continuous target variable based on one or more independent variables.
Dataset:
The dataset for this assignment is a fictional dataset containing information about houses. It
includes the following columns:
• Area (in square feet)
• Number of bedrooms
• Number of bathrooms
Tasks:
1. Load the dataset and display the first few rows to understand its structure.
2. Split the dataset into independent variables (features) and the dependent variable
(target).
4. Evaluate the model's performance using metrics like Mean Squared Error (MSE) or
R-squared.
5. Implement Multiple Linear Regression to predict house prices based on the area,
number of bedrooms, and number of bathrooms.
Algorithm:
9. Load Dataset:
Examine the dataset having details about houses.
1. Load the dataset and display the first few rows to understand its structure.
CODE:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
df = pd.read_csv("C:\ML Dataset(Umang)\fictional_house_data.csv")
print(df.head())
OUTPUT:
2. Split the dataset into independent variables (features) and the dependent variable
(target).
CODE:
X = df[['Area']]
y = df['Price']
CODE:
lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)
OUTPUT:
CODE:
4. Evaluate the model's performance using metrics like Mean Squared Error (MSE) or
R-squared.
CODE:
r2 = r2_score(y_test, y_pred)
print("R-squared:", r2)
OUTPUT:
5. Implement Multiple Linear Regression to predict house prices based on the area,
number of bedrooms, and number of bathrooms.
CODE:
multiple_lin_reg = LinearRegression()
multiple_lin_reg.fit(X_train, y_train)
OUTPUT:
CODE:
y_pred_multiple = multiple_lin_reg.predict(X_test)
CODE:
print("R-squared:", r2_multiple)
OUTPUT:
Outcome:
This lab demonstrates the implementation of Linear and Multiple Linear Regression on a
dataset containing house-related information using Python. It includes splitting the dataset,
fitting the regression models, and evaluating their performance using Mean Squared Error
(MSE) and R-squared metrics.
The objective of this lab is to implement Logistic Regression Analysis on a given dataset
using Python. Logistic Regression is a powerful technique used for binary classification
tasks, where the target variable has two possible outcomes.
Dataset:
The dataset for this assignment is a fictional dataset containing information about bank
customers. It includes the following columns:
• Age
• Income
• Credit Score
Tasks:
1. Load the dataset and display the first few rows to understand its structure.
2. Split the dataset into independent variables (features) and the dependent variable
(target).
3. Implement Logistic Regression to predict loan approval status based on age, income,
and credit score.
4. Evaluate the model's performance using metrics like accuracy, precision, recall, and
F1-score.
5. Optionally, visualize the ROC curve.
Algorithm:
1. Load Dataset:
Read the dataset containing information about bank customers.
3. Split Dataset:
• Separate the dataset into independent variables (features) and the dependent
variable (target).
• Independent variables: 'Age', 'Income', 'Credit Score'
• Dependent variable: 'Loan Status'
1. Load the dataset and display the first few rows to understand its structure.
CODE:
import pandas as pd
df = pd.read_csv("C:\ML Dataset(Umang)\fictional_bank_customer_data.csv")
print(df.head())
OUTPUT:
2. Split the dataset into independent variables (features) and the dependent variable
(target).
CODE:
y = df['Loan Status']
3. Implement Logistic Regression to predict loan approval status based on age, income,
and credit score.
CODE:
logreg_model = LogisticRegression()
logreg_model.fit(X_train, y_train)
OUTPUT:
CODE:
y_pred = logreg_model.predict(X_test)
4. Evaluate the model's performance using metrics like accuracy, precision, recall, and
F1-score.
CODE:
f1 = f1_score(y_test, y_pred)
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1-score:", f1)
OUTPUT:
5. Optionally, visualize the ROC curve.
CODE:
y_prob = logreg_model.predict_proba(X_test)[:, 1]
plt.figure(figsize=(8, 6))
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.legend(loc="lower right")
plt.show()
OUTPUT:
Outcome:
The objective of this lab is to implement Decision Tree algorithm on a given dataset using
Python. Decision Trees are powerful machine learning models used for both classification
and regression tasks.
Dataset:
The dataset for this assignment is a fictional dataset containing information about patients. It
includes the following columns:
• Age
• Blood Pressure
• Cholesterol Level
Tasks:
1. Load the dataset and display the first few rows to understand its structure.
2. Split the dataset into independent variables (features) and the dependent variable
(target).
3. Implement Decision Tree Classifier to predict the presence of heart disease based on
age, blood pressure, and cholesterol level.
4. Evaluate the model's performance using metrics like accuracy, precision, recall, and
F1-score.
Algorithm:
1. Load Dataset:
Read the dataset containing information about patients.
3. Split Dataset:
• Separate the dataset into independent variables (features) and the dependent
variable (target).
• Independent variables: 'Age', 'Blood Pressure', 'Cholesterol Level'
• Dependent variable: 'Heart Disease'
1. Load the dataset and display the first few rows to understand its structure.
CODE:
import pandas as pd
df = pd.read_csv("C:\ML Dataset(Umang)\fictional_patient_data.csv")
print(df.head())
OUTPUT:
2. Split the dataset into independent variables (features) and the dependent variable
(target).
CODE:
3. Implement Decision Tree Classifier to predict the presence of heart disease based on
age, blood pressure, and cholesterol level.
CODE:
dt_classifier = DecisionTreeClassifier()
dt_classifier.fit(X_train, y_train)
OUTPUT:
CODE:
CODE:
f1 = f1_score(y_test, y_pred)
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1-score:", f1)
OUTPUT:
plt.figure(figsize=(12, 8))
plt.show()
OUTPUT:
Outcome:
This lab demonstrates the implementation of Decision Tree Classifier on a dataset containing
patient information using Python. It includes splitting the dataset, fitting the decision tree
model, evaluating its performance using accuracy, precision, recall, and F1-score metrics, and
optionally visualizing the decision tree.
The objective of this lab is to implement the Naïve Bayes algorithm on a given dataset using
Python. Naïve Bayes is a probabilistic classifier based on Bayes' theorem with the
assumption of independence between features.
Dataset:
The dataset for this assignment is a fictional dataset containing information about emails. It
includes the following columns:
• Email Text
Tasks:
1. Load the dataset and display the first few rows to understand its structure.
2. Split the dataset into independent variables (features) and the dependent variable
(target).
3. Preprocess the email text data (e.g., remove punctuation, convert to lowercase,
tokenize).
4. Implement Naïve Bayes Classifier to classify emails as spam or not spam based on
their text content.
5. Evaluate the model's performance using metrics like accuracy, precision, recall, and
F1-score.
Algorithm:
1. Load Dataset:
Read the dataset containing information about emails.
2. Display Dataset Information:
Display the first few rows of the dataset to understand its structure, including column
names and data types.
3. Split Dataset:
• Separate the dataset into independent variables (features) and the dependent
variable (target).
• Independent variable: 'Email Text'
• Dependent variable: 'Spam'
1. Load the dataset and display the first few rows to understand its structure.
CODE:
import pandas as pd
df = pd.read_csv("C:\ML Dataset(Umang)\fictional_email_data.csv")
# Display the first few rows
print(df.head())
OUTPUT:
2. Split the dataset into independent variables (features) and the dependent variable
(target).
CODE:
X = df['Email Text']
y = df['Spam']
3. Preprocess the email text data (e.g., remove punctuation, convert to lowercase,
tokenize).
CODE:
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(X)
4. Implement Naïve Bayes Classifier to classify emails as spam or not spam based on
their text content.
CODE:
nb_classifier = MultinomialNB()
nb_classifier.fit(X_train, y_train)
OUTPUT:
CODE:
5. Evaluate the model's performance using metrics like accuracy, precision, recall, and
F1-score.
CODE:
f1 = f1_score(y_test, y_pred)
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1-score:", f1)
OUTPUT:
Outcome:
This lab demonstrates the implementation of Naïve Bayes Classifier on a datas et containing
email text information using Python. It includes preprocessing the text data, fitting the Naïve
Bayes model, and evaluating its performance using accuracy, precision, recall, and F1 -score
metrics.
The objective of this assignment is to implement the k-Nearest Neighbor (KNN) algorithm on
a given dataset using Python. KNN is a simple and intuitive classification algorithm that
classifies data points based on the majority class of their nearest neighbors.
Dataset: The dataset for this assignment is a fictional dataset containing information about
fruits. It includes the following columns:
• Color
• Diameter
1. Load the dataset and display the first few rows to understand its structure.
2. Split the dataset into independent variables (features) and the dependent variable
(target).
5. Evaluate the model's performance using metrics like accuracy, precision, recall, and
F1-score.
Algorithm:
3. Split Dataset:
Separate the dataset into independent variables (features), which include Color and
Diameter, and the dependent variable (target), which is the Label (type of fruit).
1. Load the dataset and display the first few rows to understand its structure.
CODE:
import pandas as pd
df = pd.read_csv("C:\ML Dataset(Umang)\fictional_fruit_data.csv")
print(df.head())
OUTPUT:
2. Split the dataset into independent variables (features) and the dependent variable
(target).
CODE:
X = df[['Color', 'Diameter']]
y = df['Label']
3. Encode categorical variables using one-hot encoding or label encoding if necessary.
CODE:
label_encoder = LabelEncoder()
X['Color'] = label_encoder.fit_transform(X['Color'])
CODE:
knn_classifier = KNeighborsClassifier()
knn_classifier.fit(X_train, y_train)
OUTPUT:
CODE:
# Predict fruit label using the test data
y_pred = knn_classifier.predict(X_test)
5. Evaluate the model's performance using metrics like accuracy, precision, recall, and
F1-score.
CODE:
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1-score:", f1)
OUTPUT:
Outcome:
This lab demonstrates the implementation of KNN Classifier on a dataset containing fruit
information using Python. It includes splitting the dataset, encoding categorical variables,
fitting the KNN model, and evaluating its performance using accuracy, precision, recall, and F1-
score metrics.
Dataset: The dataset for this assignment is a fictional dataset containing information about
flowers. It includes the following columns:
• Sepal Length
• Sepal Width
• Petal Length
• Petal Width
Tasks:
1. Load the dataset and display the first few rows to understand its structure.
2. Split the dataset into independent variables (features) and the dependent variable
(target).
4. Implement SVM Classifier to classify flowers based on sepal and petal characteristics.
5. Evaluate the model's performance using metrics like accuracy, precision, recall, and
F1-score.
Algorithm:
1. Load Dataset:
Read the dataset containing information about flowers.
2. Display Dataset Information:
Display the first few rows of the dataset to understand its structure, including column
names and data types.
3. Split Dataset:
• Separate the dataset into independent variables (features) and the dependent
variable (target).
• Independent variables: 'Sepal Length', 'Sepal Width', 'Petal Length', 'Petal
Width'
• Dependent variable: 'Species'
4. Encode Categorical Variables (if necessary):
• If the 'Species' column is categorical and not numerical, encode it using one-
hot encoding or label encoding.
5. Implement SVM Classifier:
• Use the independent variables (features) to predict the flower species using
Support Vector Machine Classifier.
• Split the dataset into training and testing sets.
• Train the SVM Classifier on the training data.
• Predict flower species using the test data.
6. Evaluate Model's Performance:
• Calculate metrics like accuracy, precision, recall, and F1-score to evaluate the
performance of the SVM Classifier.
• Display the evaluation metrics to assess how well the model performs in
classifying flowers based on sepal and petal characteristics.
1. Load the dataset and display the first few rows to understand its structure.
CODE:
import pandas as pd
df = pd.read_csv("C:\ML Dataset(Umang)\fictional_flower_data.csv")
# Display the first few rows
print(df.head())
OUTPUT:
2. Split the dataset into independent variables (features) and the dependent variable
(target).
CODE:
X = df.drop(columns=['Species'])
y = df['Species']
CODE:
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(y)
4. Implement SVM Classifier to classify flowers based on sepal and petal characteristics.
CODE:
svm_classifier = SVC(kernel='linear')
svm_classifier.fit(X_train, y_train)
OUTPUT:
5. Evaluate the model's performance using metrics like accuracy, precision, recall, and
F1-score.
CODE:
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1-score:", f1)
OUTPUT:
Outcome:
This lab demonstrates the implementation of SVM Classifier on a dataset containing flower
information using Python. It includes splitting the dataset, encoding categorical variables,
fitting the SVM model, and evaluating its performance using accuracy, precision, recall, and
F1-score metrics.
The objective of this lab is to implement a Neural Network (NN) and Multi-layer Perceptron
(MLP) on a given dataset using Python. NN and MLP are powerful deep learning models
used for various machine learning tasks.
Dataset: The dataset for this assignment is a fictional dataset containing information about
bank customers. It includes the following columns:
• Age
• Income
• Credit Score
Tasks:
1. Load the dataset and display the first few rows to understand its structure.
2. Split the dataset into independent variables (features) and the dependent variable
(target).
3. Preprocess the data if necessary (e.g., scaling).
4. Implement a Neural Network model to predict loan approval status based on customer
information.
5. Evaluate the model's performance using metrics like accuracy, precision, recall, and
F1-score.
Algorithm:
1. Load Dataset:
Read the dataset containing information about bank customers.
3. Split Dataset:
• Separate the dataset into independent variables (features) and the dependent
variable (target).
• Independent variables: 'Age', 'Income', 'Credit Score'
• Dependent variable: 'Loan Status'
1. Load the dataset and display the first few rows to understand its structure.
CODE:
import pandas as pd
df = pd.read_csv("C:\ML Dataset(Umang)\fictional_bank_customer_data.csv")
print(df.head())
OUTPUT:
2. Split the dataset into independent variables (features) and the dependent variable
(target).
CODE:
X = df.drop(columns=['Loan Status'])
y = df['Loan Status']
CODE:
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
4. Implement a Neural Network model to predict loan approval status based on customer
information.
CODE:
mlp_classifier.fit(X_train, y_train)
OUTPUT:
CODE:
y_pred = mlp_classifier.predict(X_test)
5. Evaluate the model's performance using metrics like accuracy, precision, recall, and
F1-score.
CODE:
f1 = f1_score(y_test, y_pred)
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1-score:", f1)
OUTPUT:
Outcome:
Dataset: The dataset for this assignment is a fictional dataset containing information about
customers. It includes the following columns:
• Customer ID
• Age
• Income
Tasks:
1. Load the dataset and display the first few rows to understand its structure.
3. Implement the k-Means Clustering algorithm to cluster customers based on their age
and income.
4. Determine the optimal number of clusters using the Elbow Method or Silhouette
Score.
Algorithm:
1. Load Dataset:
Read the dataset containing information about customers.
1. Load the dataset and display the first few rows to understand its structure.
CODE:
import pandas as pd
import numpy as np
df = pd.read_csv("C:\ML Dataset(Umang)\clustering.csv")
print(df.head())
OUTPUT:
scaler = StandardScaler()
3. Implement the k-Means Clustering algorithm to cluster customers based on their age
and income.
4. Determine the optimal number of clusters using the Elbow Method or Silhouette
Score.
CODE:
silhouette_scores = []
cluster_labels = kmeans.fit_predict(scaled_data)
silhouette_scores.append(silhouette_avg)
optimal_k = np.argmax(silhouette_scores) + 2
cluster_labels = kmeans.fit_predict(scaled_data)
CODE:
plt.figure(figsize=(10, 6))
plt.scatter(scaled_data[:, 0], scaled_data[:, 1], c=cluster_labels, cmap='viridis',
marker='o', edgecolors='k')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=300, c='red',
marker='*', label='Centroids')
plt.title('k-Means Clustering')
plt.xlabel('Scaled Age')
plt.ylabel('Scaled Income')
plt.legend()
plt.colorbar(label='Cluster')
plt.grid(True)
plt.show()
OUTPUT:
Outcome:
This solution demonstrates the implementation of k-Means Clustering on a dataset containing
customer information using Python. It includes preprocessing the data, determining the
optimal number of clusters using the Elbow Method and Silhouette Score, and visualizing the
clusters in a scatter plot.