0% found this document useful (0 votes)
13 views21 pages

Tanu Raman ML Lab File

The document is a lab file for a Machine Learning course at Indira Gandhi Delhi Technical University for Women, detailing various programming tasks related to data visualization, cleaning, preprocessing, prediction, reinforcement learning, and clustering. Each lab section includes aims and code examples for implementing different machine learning techniques such as histograms, linear regression, K-means clustering, and DBSCAN. The document serves as a practical guide for students to apply machine learning concepts in real-world scenarios.

Uploaded by

tanu144btcsai22
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views21 pages

Tanu Raman ML Lab File

The document is a lab file for a Machine Learning course at Indira Gandhi Delhi Technical University for Women, detailing various programming tasks related to data visualization, cleaning, preprocessing, prediction, reinforcement learning, and clustering. Each lab section includes aims and code examples for implementing different machine learning techniques such as histograms, linear regression, K-means clustering, and DBSCAN. The document serves as a practical guide for students to apply machine learning concepts in real-world scenarios.

Uploaded by

tanu144btcsai22
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Indira Gandhi Delhi Technical University for Women

(Established by Govt. of Delhi vide Act 09 of 2012)

Kashmere Gate, Delhi - 110006

LAB FILE
For
MACHINE LEARNING
(BAI-301)

B.Tech./CSE-AI

Department of Artificial Intelligence & Data


SciencesODD Semester 2024

Submitted to: Submitted by:


Ritika Kumari Tanu Raman
Dept. of AI&DS 14401172022
INDEX
S. Topic sign
No.
1 Program to perform visualization and interpretation of data:
a) Histogram, b) Scatter Plot, c) Bar Chart and d) Pie Chart
2 Program to perform cleaning of the dataset: a) Drop a
Variable, b) Remove Null Values and c) Remove Duplicates

3 Program to perform data preprocessing: a) Standardization,


b) Normalization and c) SMOTE

4 Program to perform data prediction using any dataset, for


example, iris dataset

5 Program to perform oversampling and under sampling of


considered data using machine learning

6 Program to implement the reinforcement learning algorithm


on a dataset

7 Program to perform the data sampling and estimation using


density-based clustering method

8 Program to perform the model regularization, PCA and


optimization using the feature selection methods

9 Program to perform the kernel based SVM modelling of the


considered dataset

10 Program to perform the model regularization and


optimization using ensemble methods
LAB-1
AIM: Program to perform visualisation and interpretation of data:
1.Histogram
2.Scatter Plot
3.Bar Chart
4.Pie Chart
CODE:
import matplotlib.pyplot as plt
import numpy as np

# Sample data: Monthly sales of different products


product_A_sales = [150, 180, 200, 220, 250, 270, 290, 300, 280, 270, 250, 230]
product_B_sales = [100, 130, 150, 180, 210, 250, 270, 290, 260, 240, 230, 210]
product_C_sales = [50, 80, 120, 150, 180, 210, 240, 270, 250, 230, 200, 170]

# Months of the year


months = np.arange(1, 13)

# 1. Histogram: Distribution of Product A's sales


plt.figure(figsize=(7, 4))
plt.hist(product_A_sales, bins=5, color='blue', edgecolor='black')
plt.title("Histogram of Product A's Monthly Sales")
plt.xlabel("Sales Volume")
plt.ylabel("Frequency")
plt.show()

print('\n************************************************************************************
****\n')
# 2. Scatter Plot: Product A vs Product B sales
plt.figure(figsize=(7, 4))
plt.scatter(product_A_sales, product_B_sales, color='orange')
plt.title("Scatter Plot: Product A Sales vs Product B Sales")
plt.xlabel("Product A Sales")
plt.ylabel("Product B Sales")
plt.show()

print('\n************************************************************************************
****\n')
# 3. Bar Chart: Monthly Sales of Product C
plt.figure(figsize=(7, 4))
plt.bar(months, product_C_sales, color='green')
plt.title("Bar Chart: Monthly Sales of Product C")
plt.xlabel("Month")
plt.ylabel("Sales Volume")
plt.xticks(months)
plt.show(

print('\n************************************************************************************
****\n')
# 4. Pie Chart: Percentage of Total Sales by Product
total_sales = [sum(product_A_sales), sum(product_B_sales), sum(product_C_sales)]
products = ['Product A', 'Product B', 'Product C']

plt.figure(figsize=(5, 5))
plt.pie(total_sales, labels=products, autopct='%1.1f%%', startangle=140)
plt.title("Pie Chart: Total Sales Distribution by Product")
plt.show()
Output:
LAB-2
AIM:
Program to perform cleaning of the dataset:
1. Drop a variable
2. Remove null values
3. Remove duplicate values
CODE:
import pandas as pd

# Sample dataset with some missing values and duplicates


data = {
'ProductID': [101, 102, 103, 104, 105, 106, 107, 108, 109, None, 110, 109],
'ProductName': ['Laptop', 'Tablet', 'Smartphone', 'Monitor', 'Keyboard', 'Mouse', 'Headphones', 'Speaker',
'Camera', 'Monitor', 'Mouse', 'Smartphone'],
'Price': [1200, 450, 700, 300, 100, 25, 150, 200, 350, 300, 25, 700],
'Stock': [50, 200, 100, 30, 250, 500, 150, 75, 60, 30, 500, 100]
}

df = pd.DataFrame(data)
print(df)
print('**************************************************************************************
**\n\nAfter dropping the "ProductName" column\n')

# 1. Drop a variable (e.g., 'ProductName')


df.drop(columns=['ProductName'], inplace=True)
print(df)
print('**************************************************************************************
**\n\nAfter dropping the null values\n')

# 2. Remove null values


df.dropna(inplace=True)
print(df)
print('**************************************************************************************
**\n\nAfter removing the duplicates values\n')

# 3. Remove duplicate values


df.drop_duplicates(inplace=True)
print(df)
OUTPUT:
LAB -3
AIM:
Program to perform data preprocessing
1.SMOTE
2.Standardization
3.Normalisation

Code:
import pandas as pd
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from imblearn.over_sampling import SMOTE

# Sample dataset
data = {'column1': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'column2': [100, 200, 300, 400, 500, 100, 200, 300, 400, 500]}

df = pd.DataFrame(data)

# 1. SMOTE (assuming 'column1' is the target variable)


X = df.drop('column1', axis=1)
y = df['column1']

smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X, y)

# 2. Standardization
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_resampled)

# 3. Normalization
scaler = MinMaxScaler()
X_normalized = scaler.fit_transform(X_resampled)

print("X_resampled:\n", X_resampled,'\n\n')
print("X_scaled:\n", X_scaled,'\n\n')
print("X_normalized:\n", X_normalized)
Output:
LAB-4
AIM:
Program to perform data prediction using any dataset, for example, iris dataset, through:
1. Supervised Learning (Linear Regression)
2. Unsupervised Learning (K-Means Clustering {k=5})

CODE:
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_diabetes, make_blobs
from sklearn.linear_model import LinearRegression
from sklearn.cluster import KMeans
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Supervised Learning: Linear Regression


def linear_regression_demo():
# Load the Diabetes dataset
diabetes = load_diabetes()
X = diabetes.data[:, :1] # Use only one feature for simplicity
y = diabetes.target

# Split the data into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)

# Train the model


model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Evaluate the model


mse = mean_squared_error(y_test, y_pred)
print("Linear Regression Mean Squared Error:", mse)

# Plot the results


plt.scatter(X_test, y_test, color="blue", label="Actual")
plt.plot(X_test, y_pred, color="red", label="Predicted")
plt.title("Linear Regression")
plt.xlabel("Feature")
plt.ylabel("Target")
plt.legend()
plt.show()

# Unsupervised Learning: K-Means Clustering


def k_means_clustering_demo():
# Generate a synthetic dataset
X, _ = make_blobs(n_samples=200, centers=5, cluster_std=1.0,
random_state=42)

# Apply K-Means Clustering


kmeans = KMeans(n_clusters=5, random_state=42)
kmeans.fit(X)

# Get cluster labels and centroids


labels = kmeans.labels_
centroids = kmeans.cluster_centers_
print("K-Means Cluster Centers:\n", centroids)

# Visualize the clustering


plt.scatter(X[:, 0], X[:, 1], c=labels, cmap="viridis", label="Clusters")
plt.scatter(centroids[:, 0], centroids[:, 1], s=200, c="red", marker="X",
label="Centroids")
plt.title("K-Means Clustering (k=5)")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.legend()
plt.show()

# Run the demos


if __name__ == "__main__":
print("Performing Linear Regression...")
linear_regression_demo()
print("\nPerforming K-Means Clustering...")
k_means_clustering_demo()
OUTPUT:

LAB-5
AIM:
Program to implement reinforcement learning on a dataset

CODE:
import numpy as np
import matplotlib.pyplot as plt
# Environment: Define a simple grid-world
class GridWorld:
def __init__(self, rows, cols, start, goal, obstacles=[]):
self.rows = rows
self.cols = cols
self.start = start
self.goal = goal
self.obstacles = obstacles
self.state = start

def reset(self):
self.state = self.start
return self.state

def step(self, action):


# Actions: 0=up, 1=right, 2=down, 3=left
moves = {0: (-1, 0), 1: (0, 1), 2: (1, 0), 3: (0, -1)}
next_state = (self.state[0] + moves[action][0], self.state[1] + moves[action]
[1])

# Check if next state is valid


if (0 <= next_state[0] < self.rows and
0 <= next_state[1] < self.cols and
next_state not in self.obstacles):
self.state = next_state
else:
next_state = self.state # Stay in the same state if invalid move

# Check reward
if next_state == self.goal:
return next_state, 1, True # Goal reached
return next_state, -0.01, False # Small penalty for each step

def render(self):
grid = np.zeros((self.rows, self.cols))
grid[self.goal] = 2
for obs in self.obstacles:
grid[obs] = -1
grid[self.state] = 1
print(grid)

# Q-Learning implementation
def q_learning(env, episodes, alpha=0.1, gamma=0.99, epsilon=0.1):
q_table = np.zeros((env.rows, env.cols, 4)) # Initialize Q-table
for episode in range(episodes):
state = env.reset()
done = False
while not done:
# Epsilon-greedy action selection
if np.random.rand() < epsilon:
action = np.random.randint(4) # Explore
else:
action = np.argmax(q_table[state[0], state[1]]) # Exploit

# Take action
next_state, reward, done = env.step(action)

# Q-Learning update rule


q_table[state[0], state[1], action] = q_table[state[0], state[1], action] +
alpha * (
reward + gamma * np.max(q_table[next_state[0], next_state[1]]) -
q_table[state[0], state[1], action]
)
state = next_state # Move to next state

return q_table

# Main function to execute the RL algorithm


def main():
# Define the environment
env = GridWorld(rows=5, cols=5, start=(0, 0), goal=(4, 4), obstacles=[(1, 1), (2,
2), (3, 3)])
episodes = 1000
q_table = q_learning(env, episodes)

# Render optimal policy


print("\nOptimal Policy:")
for i in range(env.rows):
for j in range(env.cols):
if (i, j) == env.goal:
print(" G ", end="")
elif (i, j) in env.obstacles:
print(" # ", end="")
else:
action = np.argmax(q_table[i, j])
actions = ["↑", "→", "↓", "←"]
print(f" {actions[action]} ", end="")
print()
# Simulate a run with optimal policy
print("\nSimulation with Optimal Policy:")
state = env.reset()
done = False
env.render()
while not done:
action = np.argmax(q_table[state[0], state[1]])
state, _, done = env.step(action)
env.render()

if __name__ == "__main__":
main()

OUTPUT:
LAB-6
AIM:
Program to perform data sampling and estimation using density based
clustering method.

CODE:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score

# Generate synthetic dataset (you can replace this with your dataset)
def generate_data():
# For demonstration, let's create a simple moon-shaped dataset
from sklearn.datasets import make_moons
X, _ = make_moons(n_samples=300, noise=0.05, random_state=42)
return X

# DBSCAN clustering and data processing


def dbscan_clustering(X, eps=0.3, min_samples=5):
# Scale the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply DBSCAN clustering


dbscan = DBSCAN(eps=eps, min_samples=min_samples)
clusters = dbscan.fit_predict(X_scaled)

# Add cluster labels to the dataset


df = pd.DataFrame(X, columns=["Feature_1", "Feature_2"])
df["Cluster"] = clusters

# Sampling data from clusters (excluding noise)


sampled_data = df[df["Cluster"] != -1].groupby("Cluster").apply(lambda x:
x.sample(frac=0.2, random_state=42))
sampled_data.reset_index(drop=True, inplace=True)

return X_scaled, clusters, sampled_data


# Silhouette score for cluster quality estimation
def calculate_silhouette(X_scaled, clusters):
if len(set(clusters)) > 1:
silhouette_avg = silhouette_score(X_scaled, clusters)
print(f"Silhouette Score for DBSCAN Clustering: {silhouette_avg:.4f}")
else:
silhouette_avg = None
print("Silhouette Score could not be calculated (no clusters formed).")
return silhouette_avg

# Visualization of clustering results


def plot_clusters(X_scaled, clusters, sampled_data):
plt.figure(figsize=(10, 6))
plt.scatter(X_scaled[:, 0], X_scaled[:, 1], c=clusters, cmap="viridis", s=50,
label="All Points")
plt.scatter(sampled_data["Feature_1"], sampled_data["Feature_2"], c="red",
edgecolor="k", s=100, label="Sampled Points")
plt.title("DBSCAN Clustering and Sampled Data Points", fontsize=14)
plt.xlabel("Feature 1 (scaled)", fontsize=12)
plt.ylabel("Feature 2 (scaled)", fontsize=12)
plt.legend()
plt.show()

# Summary of results
def print_summary(clusters, sampled_data):
print("\n--- Clustering Summary ---")
print(f"Number of clusters (excluding noise): {len(set(clusters)) - (1 if -1 in
clusters else 0)}")
print(f"Noise points: {np.sum(clusters == -1)}")

print("\n--- Sampled Data ---")


print(sampled_data.head())

# Main function to execute the DBSCAN clustering and data sampling


def main():
# Generate synthetic dataset
X = generate_data()

# Perform DBSCAN clustering and data sampling


X_scaled, clusters, sampled_data = dbscan_clustering(X, eps=0.3,
min_samples=5)

# Calculate Silhouette Score


calculate_silhouette(X_scaled, clusters)
# Visualize the clustering results
plot_clusters(X_scaled, clusters, sampled_data)

# Print clustering summary and sampled data


print_summary(clusters, sampled_data)

if __name__ == "__main__":
main()

OUTPUT:

LAB-7
AIM:
Program to perform model regularization, PCA, and optimization using
feature scaling methods.

CODE:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import Ridge, Lasso
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
np.random.seed(42)
n_samples, n_features = 1000, 20
X = np.random.rand(n_samples, n_features)
y = np.random.choice([0, 1], size=n_samples)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=42)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
pca = PCA(n_components=0.95)
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)
print(f"Original Features: {X_train.shape[1]}")
print(f"Reduced Features after PCA: {X_train_pca.shape[1]}")
ridge = Ridge(alpha=1.0)
lasso = Lasso(alpha=0.01)
ridge.fit(X_train_pca, y_train)
lasso.fit(X_train_pca, y_train)
selector = SelectFromModel(lasso, prefit=True)
X_train_selected = selector.transform(X_train_pca)
X_test_selected = selector.transform(X_test_pca)
print(f"Features selected by Lasso: {X_train_selected.shape[1]}")
param_grid = {
'n_estimators': [50, 100, 200],
'max_depth': [None, 10, 20],
'min_samples_split': [2, 5, 10],
}
rf = RandomForestClassifier(random_state=42)
grid_search = GridSearchCV(rf, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train_selected, y_train)
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test_selected)
accuracy = accuracy_score(y_test, y_pred)
print(f"Best Parameters: {grid_search.best_params_}")
print(f"Test Accuracy: {accuracy:.4f}")

CODE:
LAB-8
AIM:
Program to perform the kernel based SVM modelling of the considered
dataset.

CODE:
import numpy as np
import pandas as pd
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.metrics import classi cation_report, accuracy_score

iris = datasets.load_iris()
X = iris.data
y = iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=42)
scaler = StandardScaler()
X_train = scaler. t_transform(X_train)
X_test = scaler.transform(X_test)
svm_model = SVC(kernel='rbf', gamma='scale', C=1.0) # You can
experiment with the 'kernel', 'C', and 'gamma' parameters
svm_model. t(X_train, y_train)
y_pred = svm_model.predict(X_test)
print("Accuracy Score: ", accuracy_score(y_test, y_pred))
print("Classi cation Report: \n", classi cation_report(y_test, y_pred))
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
X_train_2d = pca. t_transform(X_train)
X_test_2d = pca.transform(X_test)
plt. gure( gsize=(8, 6))
plt.scatter(X_test_2d[:, 0], X_test_2d[:, 1], c=y_test, cmap=plt.cm.Paired,
s=30, edgecolors='k')
plt.title("SVM Decision Boundaries (RBF Kernel) with PCA-reduced Data")
fi
fi
fi
fi
fi
fi
fi
fi
plt.show()

OUTPUT:

LAB-9
AIM:
Program to perform the model regularization and optimization using ensemble
methods.

CODE:
import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor,
StackingRegressor
from sklearn.linear_model import Ridge, Lasso
from sklearn.metrics import mean_squared_error
data = fetch_california_housing()
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
random_forest = RandomForestRegressor(n_estimators=100, random_state=42)
random_forest. t(X_train, y_train)
y_pred_rf = random_forest.predict(X_test)
mse_rf = mean_squared_error(y_test, y_pred_rf)
print(f"Random Forest MSE: {mse_rf:.3f}")
gradient_boosting = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1,
random_state=42)
gradient_boosting. t(X_train, y_train)
y_pred_gb = gradient_boosting.predict(X_test)
mse_gb = mean_squared_error(y_test, y_pred_gb)
print(f"Gradient Boosting MSE: {mse_gb:.3f}")
base_models = [
('ridge', Ridge(alpha=1.0)),
('lasso', Lasso(alpha=0.1)),
('rf', RandomForestRegressor(n_estimators=100, random_state=42))
]
meta_model = GradientBoostingRegressor(n_estimators=50, learning_rate=0.1,
random_state=42)
stacking_regressor = StackingRegressor(estimators=base_models,
nal_estimator=meta_model, cv=5)
stacking_regressor. t(X_train, y_train)
y_pred_stack = stacking_regressor.predict(X_test)
mse_stack = mean_squared_error(y_test, y_pred_stack)
print(f"Stacking Regressor MSE: {mse_stack:.3f}")

print("\nSummary of Model Performance:")


print(f"Bagging (Random Forest) MSE: {mse_rf:.3f}")
print(f"Boosting (Gradient Boosting) MSE: {mse_gb:.3f}")
print(f"Stacking Ensemble MSE: {mse_stack:.3f}”)

OUTPUT:

LAB-10
fi
fi
fi
fi
AIM:
Program to perform the oversampling (SVM Smote) and undersampling (Random
sampling) of a considered dataset using Machine Learning

CODE:
import numpy as np
import pandas as pd
from sklearn.datasets import make_classi cation
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import classi cation_report, accuracy_score
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from collections import Counter
def evaluate_model(X_train, X_test, y_train, y_test):
model = SVC(kernel='linear', random_state=42)
model. t(X_train, y_train)
y_pred = model.predict(X_test)
print("Classi cation Report:\n", classi cation_report(y_test, y_pred))
print("Accuracy Score:", accuracy_score(y_test, y_pred))
def generate_data():
X, y = make_classi cation(
n_samples=1000, n_features=20, n_informative=15, n_redundant=5,
n_clusters_per_class=1, weights=[0.9, 0.1], ip_y=0, random_state=42
)
print("Original class distribution:", Counter(y))
return X, y
def oversample_smote(X, y):
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote. t_resample(X, y)
print("After SMOTE Oversampling:", Counter(y_resampled))
return X_resampled, y_resampled
def undersample_random(X, y):
undersampler = RandomUnderSampler(random_state=42)
X_resampled, y_resampled = undersampler. t_resample(X, y)
print("After Random Undersampling:", Counter(y_resampled))
return X_resampled, y_resampled
if __name__ == "__main__":

X, y = generate_data()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
fi
fi
fi
fi
fi
fi
fi
fi
fl
print("\n--- Original Data ---")
evaluate_model(X_train, X_test, y_train, y_test)
X_smote, y_smote = oversample_smote(X_train, y_train)
print("\n--- After SMOTE Oversampling ---")
evaluate_model(X_smote, X_test, y_smote, y_test)
X_under, y_under = undersample_random(X_train, y_train)
print("\n--- After Random Undersampling ---")
evaluate_model(X_under, X_test, y_under, y_test)

OUTPUT:

You might also like