0% found this document useful (0 votes)
19 views42 pages

PML Lab Exp Full

Uploaded by

rudrav728
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views42 pages

PML Lab Exp Full

Uploaded by

rudrav728
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

Experiment 1: Introduction to Python Programming

Aim
The goal of this experiment is to introduce the basics of Python programming, including syntax,
data types, control structures, and basic operations.

1. Basics of Python Programming


Hello, World!
The simplest Python program prints ”Hello, World!” to the console.
1 print ( " Hello , World ! " )

Variables and Data Types


In Python, the type of a variable is automatically assigned based on the value assigned.
1 # Integer
2 x = 5
3 print ( " x is of type : " , type ( x ) )
4

5 # Float
6 y = 3.14
7 print ( " y is of type : " , type ( y ) )
8

9 # String
10 name = " Alice "
11 print ( " name is of type : " , type ( name ) )
12

13 # Boolean
14 is_student = True
15 print ( " is_student is of type : " , type ( is_student ) )

Basic Operations
Python supports standard arithmetic operations.
1 a = 10
2 b = 3
3

4 print ( " Addition : " , a + b )


5 print ( " Subtraction : " , a - b )
6 print ( " Multiplication : " , a * b )

1
AML311 - PML Lab

7 print ( " Division : " , a / b )


8 print ( " Integer Division : " , a // b )
9 print ( " Modulus : " , a % b )
10 print ( " Exponentiation : " , a ** b )

2. Control Structures
If Statement
The if statement is used for conditional execution.
1 age = 20
2

3 if age >= 18:


4 print ( " You are an adult . " )
5 else :
6 print ( " You are a minor . " )

For Loop
The for loop in Python is used for iterating over a sequence.
1 # Iterating over a list
2 fruits = [ " apple " , " banana " , " cherry " ]
3 for fruit in fruits :
4 print ( fruit )
5

6 # Using range () function


7 for i in range (5) :
8 print ( i )

While Loop
The while loop continues executing as long as the condition is true.
1 count = 0
2 while count < 5:
3 print ( count )
4 count += 1

3. Data Structures
Lists
Lists are ordered collections of items.
1 # Creating a list
2 numbers = [1 , 2 , 3 , 4 , 5]
3

4 # Accessing elements
5 print ( " First element : " , numbers [0])

2 CEK
AML311 - PML Lab

6 print ( " Last element : " , numbers [ -1])


7

8 # Modifying elements
9 numbers [0] = 10
10 print ( " Modified list : " , numbers )
11

12 # Adding elements
13 numbers . append (6)
14 print ( " List after appending : " , numbers )
15

16 # Removing elements
17 numbers . remove (3)
18 print ( " List after removing 3: " , numbers )

Strings
Strings are sequences of characters.
1 # Creating a string
2 message = " Hello , World ! "
3

4 # Accessing characters
5 print ( " First character : " , message [0])
6 print ( " Last character : " , message [ -1])
7

8 # Slicing
9 print ( " First 5 characters : " , message [:5])
10 print ( " Last 5 characters : " , message [ -5:])
11

12 # String methods
13 print ( " Uppercase : " , message . upper () )
14 print ( " Lowercase : " , message . lower () )
15 print ( " Count of ’o ’: " , message . count ( ’o ’) )
16 print ( " Replace ’ World ’ with ’ Python ’: " , message . replace ( " World " , "
Python " ) )
17

18 # Concatenation
19 greeting = " Hello "
20 name = " Alice "
21 full_greeting = greeting + " , " + name + " ! "
22 print ( full_greeting )

4. Functions
Functions in Python are defined using the def keyword.
1 # Defining a function
2 def greet ( name ) :
3 return f " Hello , { name }! "
4

5 # Calling a function

3 CEK
AML311 - PML Lab

6 message = greet ( " Alice " )


7 print ( message )

5. Examples Combining Multiple Concepts


Example 1: Check if a Number is Prime
This example demonstrates how to check if a number is a prime number without using functions.
1 num = 29
2 if num <= 1:
3 print ( num , " is not a prime number . " )
4 else :
5 is_prime = True
6 for i in range (2 , num ) :
7 if num % i == 0:
8 is_prime = False
9 break
10 if is_prime :
11 print ( num , " is a prime number . " )
12 else :
13 print ( num , " is not a prime number . " )

Example 2: Sum of Elements in a List


This example calculates the sum of elements in a list without using functions.
1 numbers = [1 , 2 , 3 , 4 , 5]
2 total = 0
3 for num in numbers :
4 total += num
5 print ( " Sum of elements in the list : " , total )

Result
In this experiment, the basics of Python programming were introduced. Fundamental operations
such as printing output and performing arithmetic operations were covered. Control structures,
including if statements and loops (for and while), were explored to handle conditional and repet-
itive tasks. Essential data structures like lists and strings for storing and manipulating collections
of items were discussed. Practical examples demonstrated how to combine these concepts to solve
simple problems, such as checking for prime numbers and summing elements in a list.

4 CEK
Experiment 2: Familiarization of basic Python Libraries

Aim
The aim of this experiment is to familiarize with the basic Python libraries used in data science
and machine learning: NumPy, Pandas, Matplotlib, and Scikit-learn. This includes understanding
their core functionalities, performing basic operations, and visualizing data.

NumPy
NumPy (Numerical Python) is a library used for working with arrays and provides mathematical
functions to operate on these arrays.

Basic Operations with NumPy


• Creating Arrays
• Array Operations
• Mathematical Functions
Example Code:
1 import numpy as np
2
3 # Creating Arrays
4 array1 = np . array ([1 , 2 , 3 , 4 , 5])
5 array2 = np . array ([[1 , 2 , 3] , [4 , 5 , 6]])
6
7 print ( " Array 1: " , array1 )
8 print ( " Array 2:\ n " , array2 )
9
10 # Array Operations
11 sum_array = array1 + 10
12 product_array = array1 * 2
13
14 print ( " Sum Array : " , sum_array )
15 print ( " Product Array : " , product_array )
16
17 # Mathematical Functions
18 sin_array = np . sin ( array1 )
19 log_array = np . log ( array1 )
20
21 print ( " Sine Array : " , sin_array )
22 print ( " Log Array : " , log_array )

Pandas
Pandas is a powerful library for data manipulation and analysis. It provides data structures like
DataFrame.

1
AML311 - PML Lab

Basic Operations with Pandas


• Creating DataFrames

• DataFrame Operations

• Data Manipulation

Example Code:
1 import pandas as pd
2
3 # Creating DataFrame
4 data = {
5 ’ Name ’: [ ’ Alice ’ , ’ Bob ’ , ’ Charlie ’] ,
6 ’ Age ’: [24 , 27 , 22] ,
7 ’ City ’: [ ’ New York ’ , ’ Los Angeles ’ , ’ Chicago ’]
8 }
9 df = pd . DataFrame ( data )
10 print ( " DataFrame :\ n " , df )
11

12 # DataFrame Operations
13 print ( " Mean Age : " , df [ ’ Age ’ ]. mean () )
14 print ( " Data in Age column :\ n " , df [ ’ Age ’ ])
15
16 # Data Manipulation
17 df [ ’ Age ’] = df [ ’ Age ’] + 1
18 print ( " Updated DataFrame :\ n " , df )

Matplotlib
Matplotlib is a plotting library for creating static, animated, and interactive visualizations in
Python.

Basic Plotting with Matplotlib


• Line Plot

• Bar Plot

• Scatter Plot

Example Code:
1 import matplotlib . pyplot as plt
2
3 # Line Plot
4 x = [1 , 2 , 3 , 4 , 5]
5 y = [2 , 3 , 5 , 7 , 11]
6 plt . plot (x , y , label = ’ Line Plot ’)
7 plt . xlabel ( ’X - axis ’)
8 plt . ylabel ( ’Y - axis ’)
9 plt . title ( ’ Line Plot Example ’)
10 plt . legend ()
11 plt . show ()
12
13 # Bar Plot
14 categories = [ ’A ’ , ’B ’ , ’C ’]
15 values = [10 , 15 , 7]
16 plt . bar ( categories , values , color = ’ green ’)

2 CEK
AML311 - PML Lab

17 plt . xlabel ( ’ Categories ’)


18 plt . ylabel ( ’ Values ’)
19 plt . title ( ’ Bar Plot Example ’)
20 plt . show ()
21
22 # Scatter Plot
23 x = [1 , 2 , 3 , 4 , 5]
24 y = [5 , 7 , 3 , 8 , 4]
25 plt . scatter (x , y , label = ’ Scatter Plot ’ , color = ’ red ’)
26 plt . xlabel ( ’X - axis ’)
27 plt . ylabel ( ’Y - axis ’)
28 plt . title ( ’ Scatter Plot Example ’)
29 plt . legend ()
30 plt . show ()

Scikit-learn (Sklearn)
Scikit-learn is a machine learning library for Python. It features various classification, regression,
and clustering algorithms.

Basic Operations with Scikit-learn


• Loading Data

• Training a Model

• Making Predictions

Example Code:
1 from sklearn import datasets
2 from sklearn . model_selection import train_test_split
3 from sklearn . linear_model import LinearRegression
4 from sklearn . metrics import me an_ sq ua re d_ er ro r
5
6 # Loading Dataset
7 iris = datasets . load_iris ()
8 X = iris . data
9 y = iris . target
10
11 # Using only two features for simplicity
12 X = X [: , :2]
13

14 # Splitting Data
15 X_train , X_test , y_train , y_test = train_test_split (X , y , test_size =0.2 ,
random_state =42)
16
17 # Training a Model
18 model = LinearRegression ()
19 model . fit ( X_train , y_train )
20
21 # Making Predictions
22 predictions = model . predict ( X_test )
23
24 # Evaluating the Model
25 mse = me an _s qu ar ed_ er ro r ( y_test , predictions )
26 print ( " Mean Squared Error : " , mse )

3 CEK
AML311 - PML Lab

Result
The functionalities of essential Python libraries for data science and machine learning were explored.
NumPy was used for array operations and mathematical functions. Pandas was used for data ma-
nipulation with DataFrames. Matplotlib was employed to create various plots for data visualization.
Finally, Scikit-learn was utilized for loading datasets, training a machine learning model, making
predictions, and evaluating model performance.

Viva
NumPy
1. What is NumPy and why is it used in Python programming?
NumPy is a library for numerical computations in Python, providing support for arrays and
mathematical functions to perform operations efficiently.

2. Explain the difference between a NumPy array and a Python list.


A NumPy array is a fixed-size, multi-dimensional container of elements of the same type,
supporting element-wise operations, while a Python list can hold elements of different types
and does not support direct mathematical operations.

3. How can you create a NumPy array? Provide an example.


You can create a NumPy array using np.array(). Example: np.array([1, 2, 3]).

4. What are some common operations you can perform on NumPy arrays?
Common operations include arithmetic operations (addition, multiplication), reshaping, slic-
ing, and applying mathematical functions like mean, sum, and standard deviation.

5. How does broadcasting work in NumPy? Provide an example.


Broadcasting allows NumPy to perform arithmetic operations on arrays of different shapes
by automatically expanding the smaller array to match the larger array’s shape. Example:
Adding a scalar to an array.

6. What are some mathematical functions provided by NumPy? Can you give ex-
amples?
Functions include np.mean(), np.sum(), np.sin(), and np.log(). Example: np.mean([1,
2, 3]) computes the average.

7. How do you compute the mean and standard deviation of a NumPy array?
Use np.mean(array) and np.std(array), respectively.

Pandas
1. What is Pandas and what are its primary data structures?
Pandas is a data manipulation library with two primary data structures: Series (one-dimensional)
and DataFrame (two-dimensional).

2. How do you create a DataFrame in Pandas?


You can create a DataFrame using pd.DataFrame(). Example: pd.DataFrame(data=’Column1’:
[1, 2], ’Column2’: [3, 4]).

3. What methods are available in Pandas for handling missing data?


Methods include dropna() to remove missing values and fillna() to fill missing values with
a specified value.

4 CEK
AML311 - PML Lab

4. How can you select specific rows or columns in a Pandas DataFrame?


Use df[’column name’] to select a column and df.loc[row index] or df.iloc[row index]
to select rows.

5. Explain the difference between loc and iloc in Pandas.


loc selects data by label (index names), while iloc selects data by integer location.

6. How do you merge or join two DataFrames in Pandas?


Use pd.merge(df1, df2, on=’key’) to merge DataFrames on a common key.

7. What is the purpose of the groupby method in Pandas? Provide an example.


groupby is used to group data based on certain criteria. Example: df.groupby(’column name’).mean(
computes the mean for each group.

Matplotlib
1. What is Matplotlib and why is it used in data visualization?
Matplotlib is a plotting library used to create static, interactive, and animated visualizations
in Python.

2. How do you create a basic line plot using Matplotlib?


Use plt.plot(x, y) where x and y are the data points.

3. Explain the difference between a bar plot and a scatter plot.


A bar plot displays categorical data with rectangular bars, while a scatter plot shows rela-
tionships between two continuous variables with points.

4. How can you customize the appearance of a plot in Matplotlib (e.g., colors, labels,
title)?
Use functions like plt.xlabel(), plt.ylabel(), plt.title(), and plt.plot(x, y, color=’r’)
to set labels, titles, and colors.

5. What is the purpose of the plt.show() function in Matplotlib?


plt.show() displays the plot window and renders the figure.

6. How do you save a plot as an image file using Matplotlib?


Use plt.savefig(’filename.png’) to save the plot as an image file.

7. Can you create multiple plots in a single figure using Matplotlib? How?
Yes, use plt.subplot() to create multiple subplots in one figure.

Scikit-learn
1. What is Scikit-learn and what types of tasks is it used for?
Scikit-learn is a machine learning library for Python, used for tasks such as classification,
regression, clustering, and dimensionality reduction.

2. How do you load a dataset using Scikit-learn? Provide an example with the Iris
dataset.
Use datasets.load iris() to load the Iris dataset.

3. What is the purpose of the train test split function in Scikit-learn?


It splits the dataset into training and testing sets to evaluate the model’s performance.

4. How do you fit a machine learning model to training data using Scikit-learn?
Use the fit() method on the model instance. Example: model.fit(X train, y train).

5 CEK
AML311 - PML Lab

5. What metrics can be used to evaluate the performance of a regression model?


Metrics include Mean Squared Error (MSE), Mean Absolute Error (MAE), and R-squared.

6. Explain the concept of mean squared error and how it is calculated.


Mean Squared Error (MSE) measures the average squared difference between predicted and
actual values.

7. How can you make predictions using a trained Scikit-learn model?


Use the predict() method on the model instance. Example: model.predict(X test).

6 CEK
AML311 - PML Lab

Experiment 3: Union and intersection of two lists


Aim
To write a Python program that computes the union and intersection of two lists.

Algorithm
1. Accept user input for the elements of two lists, where the elements are separated by spaces.

2. Convert the input strings into lists of integers.

3. Find the union:

• Initialize an empty list called union.


• For each element in the first list, add it to union if it is not already present.
• For each element in the second list, add it to union if it is not already present.

4. Find the intersection:

• Initialize an empty list called intersection.


• For each element in the first list, check if it is present in the second list and not already
in intersection.
• Add such elements to intersection.

5. Display the original lists, the union of the lists, and the intersection of the lists.

Code
# Accept user input for the lists
list1 = input("Enter the elements of the first list, separated by spaces: ").split()
list2 = input("Enter the elements of the second list, separated by spaces: ").split()

# Convert input strings to integers


list1 = [int(element) for element in list1]
list2 = [int(element) for element in list2]

# Find the union


union = []
for element in list1:
if element not in union:
union.append(element)
for element in list2:
if element not in union:
union.append(element)

# Find the intersection


intersection = []
for element in list1:
if element in list2 and element not in intersection:
intersection.append(element)

# Display the results

1 CEK
AML311 - PML Lab

print(f"List 1: {list1}")
print(f"List 2: {list2}")
print(f"Union: {union}")
print(f"Intersection: {intersection}")

Result
The experiment was conducted successfully, and the union and intersection of the lists were com-
puted.

2 CEK
AML311 - PML Lab

Experiment 4: Matrix multiplication


Aim
To write a Python program that multiplies two matrices.

Algorithm
1. Accept input for the number of rows and columns of the first matrix.

2. Accept input for the number of rows and columns of the second matrix.

3. Ensure that the number of columns in the first matrix is equal to the number of rows in the
second matrix to enable multiplication.

4. Input the elements for the first matrix.

5. Input the elements for the second matrix.

6. Initialize a result matrix with dimensions equal to the number of rows of the first matrix and
the number of columns of the second matrix.

7. Iterate through each row of the first matrix and each column of the second matrix.

8. Compute the dot product of the row from the first matrix and the column from the second
matrix.

9. Store the computed value in the corresponding position in the result matrix.

10. Display the first matrix, the second matrix, and the resulting matrix after multiplication.

Code
# Function to multiply two matrices
def multiply_matrices(matrix1, matrix2):
# Get dimensions of matrices
rows1 = len(matrix1)
cols1 = len(matrix1[0])
rows2 = len(matrix2)
cols2 = len(matrix2[0])

# Initialize result matrix


result = [[0 for _ in range(cols2)] for _ in range(rows1)]

# Perform matrix multiplication


for i in range(rows1):
for j in range(cols2):
for k in range(cols1):
result[i][j] += matrix1[i][k] * matrix2[k][j]

return result

# Input dimensions for the first matrix


rows1 = int(input("Enter number of rows for the first matrix: "))
cols1 = int(input("Enter number of columns for the first matrix: "))

CEK
AML311 - PML Lab

# Input dimensions for the second matrix


rows2 = int(input("Enter number of rows for the second matrix: "))
cols2 = int(input("Enter number of columns for the second matrix: "))

# Ensure that number of columns in the first matrix equals the number of rows in the
second matrix
if cols1 != rows2:
print("Matrix multiplication not possible. Number of columns in the first matrix
must equal number of rows in the second matrix.")
else:
# Input first matrix
print("Enter elements of the first matrix:")
matrix1 = []
for i in range(rows1):
row = list(map(int, input().split()))
matrix1.append(row)

# Input second matrix


print("Enter elements of the second matrix:")
matrix2 = []
for i in range(rows2):
row = list(map(int, input().split()))
matrix2.append(row)

# Multiply matrices
result = multiply_matrices(matrix1, matrix2)

# Display matrices
print("Matrix 1:")
for row in matrix1:
print(row)

print("Matrix 2:")
for row in matrix2:
print(row)

print("Resulting Matrix:")
for row in result:
print(row)

Result
The experiment was conducted successfully, and the matrices were multiplied to produce the resul-
tant matrix.

CEK
AML311 - PML Lab

Experiment 5: Most Frequent Words in a Text File


Aim
To write a Python program that finds the most frequent words in a text file.

Algorithm
1. Open the text file test.txt in read mode.

2. Read the content of the file.

3. Split the content into individual words.

4. Create a dictionary to keep track of word frequencies.

5. Iterate over each word in the list of words:

(a) Convert the word to lowercase to ensure case insensitivity.


(b) If the word is already in the dictionary, increment its count.
(c) If the word is not in the dictionary, add it with a count of 1.

6. Find the word(s) with the highest frequency in the dictionary.

7. Display the most frequent word(s) and their frequency.

Code
def find_most_frequent_words(filename):
with open(filename, ’r’) as file:
content = file.read()

words = content.split()

word_freq = {}

for word in words:


word = word.lower()
word = ’’.join(char for char in word if char.isalnum())
if word in word_freq:
word_freq[word] += 1
else:
word_freq[word] = 1

max_freq = max(word_freq.values())
most_frequent_words = [word for word, freq in word_freq.items() if freq ==
,→ max_freq]

print(f"Most frequent word(s): {’, ’.join(most_frequent_words)}")


print(f"Frequency: {max_freq}")

filename = ’test.txt’

find_most_frequent_words(filename)

1 CEK
AML311 - PML Lab

Result
The experiment was conducted successfully, and the most frequent words in the text file were found
and displayed.

2 CEK
Experiment 6: Single, Multi variable and Polynomial Regression

Aim
Implement and demonstrate Single-variable, Multi-variable, and Polynomial Regression for a given set of
training data stored in a .CSV file and evaluate the accuracy.

Algorithm
1. Load the dataset from the CSV file into a pandas DataFrame.

2. Extract the features and target variable from the DataFrame.

3. Split the data into training and testing sets.

4. Perform Single-variable Regression:

• Use only one feature for training the model.


• Train the model and make predictions.
• Calculate and print the Mean Squared Error (MSE) of the model.

5. Perform Multi-variable Regression:

• Use multiple features for training the model.


• Train the model and make predictions.
• Calculate and print the Mean Squared Error (MSE) of the model.

6. Perform Polynomial Regression:

• Transform the features to polynomial features.


• Train the polynomial regression model and make predictions.
• Calculate and print the Mean Squared Error (MSE) of the model.

7. Plot the results for Single-variable and Polynomial Regression.

8. Save the plots as images in the folder.

Code
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

1
AML311 - PML Lab

from sklearn.preprocessing import PolynomialFeatures

# Load dataset
data = pd.read_csv(’data.csv’)

# Prepare data
X = data[[’SquareFeet’]].values
y = data[’Price’].values

# Single-variable Regression
X_train_single = X
y_train_single = y

model_single = LinearRegression()
model_single.fit(X_train_single, y_train_single)
y_pred_single = model_single.predict(X_train_single)
mse_single = mean_squared_error(y_train_single, y_pred_single)
print("Single-variable Regression")
print(f"Mean Squared Error: {mse_single}")

# Save the plot for Single-variable Regression


plt.figure(figsize=(10, 6))
plt.scatter(X, y, color=’blue’, label=’Actual Data’)
plt.plot(X, y_pred_single, color=’red’, label=’Regression Line’)
plt.xlabel(’SquareFeet’)
plt.ylabel(’Price’)
plt.title(’Single-variable Regression: SquareFeet vs. Price’)
plt.legend()
plt.savefig(’single_variable_regression.png’)
plt.clf()

# Multi-variable Regression (with an additional feature for demonstration)


X_multi = np.concatenate([X, X**2], axis=1)
model_multi = LinearRegression()
model_multi.fit(X_multi, y)
y_pred_multi = model_multi.predict(X_multi)
mse_multi = mean_squared_error(y, y_pred_multi)
print("\nMulti-variable Regression")
print(f"Mean Squared Error: {mse_multi}")

# Save the plot for Multi-variable Regression


plt.figure(figsize=(10, 6))
plt.scatter(X, y, color=’blue’, label=’Actual Data’)
plt.plot(X, y_pred_multi, color=’red’, label=’Regression Line’)
plt.xlabel(’SquareFeet’)
plt.ylabel(’Price’)
plt.title(’Multi-variable Regression: SquareFeet vs. Price’)
plt.legend()
plt.savefig(’multi_variable_regression.png’)
plt.clf()

2 CEK
AML311 - PML Lab

# Polynomial Regression
poly = PolynomialFeatures(degree=3)
X_poly = poly.fit_transform(X)
model_poly = LinearRegression()
model_poly.fit(X_poly, y)
y_pred_poly = model_poly.predict(X_poly)
mse_poly = mean_squared_error(y, y_pred_poly)
print("\nPolynomial Regression")
print(f"Mean Squared Error: {mse_poly}")

# Save the plot for Polynomial Regression


plt.figure(figsize=(10, 6))
plt.scatter(X, y, color=’blue’, label=’Actual Data’)
plt.plot(X, y_pred_poly, color=’red’, label=’Regression Curve’)
plt.xlabel(’SquareFeet’)
plt.ylabel(’Price’)
plt.title(’Polynomial Regression: SquareFeet vs. Price’)
plt.legend()
plt.savefig(’polynomial_regression.png’)
plt.clf()

Result
The experiment was conducted for Single-variable, Multi-variable, and Polynomial Regression, and the
Mean Squared Errors for each were obtained.

Viva Questions
General Concepts
1. What is regression analysis, and why is it important?

• Regression analysis is a statistical technique used to model and analyze the relationships be-
tween a dependent variable and one or more independent variables. It helps in understanding
the strength and nature of these relationships, making predictions, and inferring causal rela-
tionships.

2. What are the differences between linear regression and polynomial regression?

• Linear regression models the relationship between the independent and dependent variables
using a straight line. Polynomial regression extends this by fitting a polynomial function to the
data, allowing for more complex, non-linear relationships.

Single-variable Regression
1. What is single-variable regression?

• Single-variable regression, or simple linear regression, involves modeling the relationship between
a single independent variable and a dependent variable. It aims to find the best-fitting line that
minimizes the error between the predicted and actual values.

3 CEK
AML311 - PML Lab

2. What are the assumptions of linear regression?


• The main assumptions are:
– Linearity: The relationship between the independent and dependent variables is linear.
– Independence: Observations are independent of each other.
– Homoscedasticity: The variance of the residuals is constant across all levels of the inde-
pendent variable.
– Normality: The residuals (errors) are normally distributed.
3. What is the Mean Squared Error (MSE), and how is it used in evaluation?
• The Mean Squared Error (MSE) measures the average squared difference between the predicted
values and the actual values. It is used to assess the accuracy of the regression model, with
lower values indicating better model performance.

Multi-variable Regression
1. What is multi-variable regression, and how does it differ from single-variable regression?
• Multi-variable regression, or multiple linear regression, models the relationship between multiple
independent variables and a dependent variable. Unlike single-variable regression, which uses
one feature, multi-variable regression uses several features to predict the target variable.
2. How does adding polynomial features in multi-variable regression help in modeling?
• Adding polynomial features allows the model to capture non-linear relationships between the
features and the target variable. This can improve model performance by fitting the data more
accurately.
3. What is the purpose of feature scaling in regression analysis?
• Feature scaling ensures that all features contribute equally to the model’s performance by
standardizing their ranges. This is particularly important in algorithms sensitive to feature
scales, though it is less critical for linear regression.

Polynomial Regression
1. What is polynomial regression, and when would you use it?
• Polynomial regression is used when the relationship between the independent and dependent
variables is non-linear. By fitting a polynomial function, the model can capture more complex
patterns in the data that linear regression cannot.
2. How do you choose the degree of the polynomial in polynomial regression?
• The degree of the polynomial is typically chosen based on model performance metrics and
cross-validation. Higher-degree polynomials can fit the training data better but may also lead
to overfitting. It’s important to balance model complexity with generalization.
3. What is overfitting, and how can it be prevented in polynomial regression?
• Overfitting occurs when the model learns the noise in the training data rather than the under-
lying pattern, resulting in poor performance on new data. To prevent overfitting, one can use
techniques such as regularization, cross-validation, and selecting an appropriate degree for the
polynomial.

4 CEK
AML311 - PML Lab

Model Evaluation and Plotting


1. Why is it important to visualize regression results?

• Visualizing regression results helps in understanding the model’s fit, identifying patterns, and
detecting any issues such as overfitting or underfitting. It also aids in communicating results
and insights to stakeholders.

2. What information does a residual plot provide?

• A residual plot shows the residuals (errors) on the y-axis versus the predicted values or another
variable on the x-axis. It helps in diagnosing issues with the regression model, such as non-
linearity, heteroscedasticity, or outliers.

3. How does saving plots assist in the regression analysis process?

• Saving plots allows for a permanent record of the model’s performance and the relationships
between variables. This is useful for documentation, reporting, and further analysis.

Practical Considerations
1. What steps would you take if your regression model performs poorly?

• If a model performs poorly, consider the following steps:


– Reevaluate Features: Ensure relevant features are included and irrelevant ones are re-
moved.
– Feature Engineering: Create new features or transform existing ones.
– Model Tuning: Adjust model parameters and perform cross-validation.
– Data Cleaning: Handle missing or inconsistent data properly.
– Try Different Models: Explore other regression techniques or algorithms.

2. How would you interpret the coefficients of a regression model?

• In linear regression, each coefficient represents the change in the dependent variable for a one-
unit change in the corresponding feature, assuming other features remain constant. In polyno-
mial regression, coefficients reflect the contribution of polynomial terms to the prediction.

5 CEK
Experiment 7: Logistic Regression

Aim
To implement and demonstrate logistic regression on a dataset and evaluate the accuracy.

Algorithm
1. Load Dataset: Read the dataset from a CSV file.

2. Prepare Data: Extract features and target variables from the dataset.

3. Split Data: Divide the dataset into training and testing sets.

4. Initialize Model: Create an instance of the Logistic Regression model.

5. Train Model: Fit the logistic regression model on the training data.

6. Predict: Use the trained model to predict the target values on the test set.

7. Evaluate Model: Calculate the accuracy and confusion matrix of the model.

8. Plot Decision Boundary: Visualize the decision boundary by plotting it along with the data
points.

Program
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix

data = pd.read_csv(’student_admission.csv’)
X = data[[’Entrance_Score’, ’12th_Class_Score’]].values
y = data[’Admitted’].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

model = LogisticRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)

1
AML311 - PML Lab

conf_matrix = confusion_matrix(y_test, y_pred)

print(f"Accuracy: {accuracy:.2f}")
print("Confusion Matrix:")
print(conf_matrix)

x_min, x_max = X[:, 0].min() - 5, X[:, 0].max() + 5


y_min, y_max = X[:, 1].min() - 5, X[:, 1].max() + 5
xx, yy = np.meshgrid(np.arange(x_min, x_max, 1), np.arange(y_min, y_max, 1))
Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

plt.figure(figsize=(10, 6))
plt.contourf(xx, yy, Z, alpha=0.8, cmap=plt.cm.coolwarm)
plt.scatter(X[:, 0], X[:, 1], c=y, edgecolors=’k’, marker=’o’, cmap=plt.cm.coolwarm)
plt.xlabel(’Entrance Score’)
plt.ylabel(’12th Class Score’)
plt.title("Logistic Regression Decision Boundary")
plt.savefig(’logistic_regression_boundary.png’)
plt.clf()

Result
The logistic regression model was successfully implemented and evaluated on the given dataset.

Viva Questions
1. What is logistic regression, and how does it differ from linear regression?
Logistic regression is a statistical model used for binary classification tasks. Unlike linear regression,
which predicts continuous outcomes, logistic regression predicts the probability of a binary outcome
by applying a logistic function to the linear combination of input features.
2. What is the purpose of the sigmoid function in logistic regression?
The sigmoid function, or logistic function, maps any real-valued number into the (0, 1) interval. It
is used in logistic regression to convert the linear output of the model into a probability that can be
used to make binary predictions.
3. What is the decision boundary in logistic regression, and how is it visualized?
The decision boundary is a line or surface that separates different classes in the feature space. It
is determined by the logistic regression model and can be visualized by plotting the probability of
classification over the feature space and highlighting the regions corresponding to different classes.
4. What is the role of the training and testing sets in model evaluation?
The training set is used to fit the model and learn the parameters, while the testing set is used to
evaluate the model’s performance on unseen data. This separation helps in assessing how well the
model generalizes to new, unseen examples.
5. What is a confusion matrix, and how is it used in evaluating a logistic regression model?
A confusion matrix is a table used to evaluate the performance of a classification model. It displays
the number of true positive, true negative, false positive, and false negative predictions. By analyzing
these values, one can assess how well the model is distinguishing between classes.

2 CEK
AML311 - PML Lab

6. What do the terms True Positive (TP), True Negative (TN), False Positive (FP), and
False Negative (FN) represent in a confusion matrix?

• True Positive (TP): The number of instances correctly predicted as positive.


• True Negative (TN): The number of instances correctly predicted as negative.
• False Positive (FP): The number of instances incorrectly predicted as positive (Type I error).
• False Negative (FN): The number of instances incorrectly predicted as negative (Type II
error).

7. How can you calculate accuracy from a confusion matrix?


Accuracy is calculated using the formula:
TP + TN
Accuracy =
TP + TN + FP + FN
It represents the proportion of correctly classified instances out of the total instances.

8. How can you use a confusion matrix to calculate precision and recall?

• Precision is calculated as:


TP
Precision =
TP + FP
It measures the proportion of true positives among the predicted positives.
• Recall (or Sensitivity) is calculated as:

TP
Recall =
TP + FN
It measures the proportion of true positives among the actual positives.

9. What is the F1-score, and how is it derived from the confusion matrix?
The F1-score is the harmonic mean of precision and recall, providing a single metric that balances
both. It is calculated as:
Precision × Recall
F1-score = 2 ×
Precision + Recall
This metric is useful when the class distribution is imbalanced and provides a more comprehensive
evaluation of the model’s performance.

10. What insights can be gained from analyzing the confusion matrix for a logistic regression
model?
Analyzing the confusion matrix helps in understanding where the model is making errors. For
example, high false positive rates may indicate that the model is overly aggressive in predicting the
positive class, while high false negative rates suggest it may be missing many positive instances. It
provides a detailed view of the model’s strengths and weaknesses in classification tasks.

3 CEK
Experiment 8: Naive Bayes Classifier

Aim
To implement a Python program that uses the Naive Bayes classifier to classify a dataset and calculate
the accuracy, precision, and recall for the dataset.

Algorithm:
1. Load the dataset from a CSV file.

2. Split the dataset into features (X) and target variable (y).

3. Split the data into training and testing sets.

4. Initialize the Naive Bayes classifier.

5. Train the model using the training data.

6. Predict the target values for the test data.

7. Evaluate the model by calculating accuracy, precision, recall, and confusion matrix.

8. Plot and save the decision boundary for visualization.

Program:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score

data = pd.read_csv(’student_performance.csv’)

X = data[[’Study_Hours’, ’Previous_Exam_Score’]].values
y = data[’Passed’].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

model = GaussianNB()

model.fit(X_train, y_train)

1
AML311 - PML Lab

y_pred = model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)


precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)

print(f"Accuracy: {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print("Confusion Matrix:")
print(conf_matrix)

x_min, x_max = X[:, 0].min() - 5, X[:, 0].max() + 5


y_min, y_max = X[:, 1].min() - 5, X[:, 1].max() + 5
xx, yy = np.meshgrid(np.arange(x_min, x_max, 1),
np.arange(y_min, y_max, 1))
Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

plt.figure(figsize=(10, 6))
plt.contourf(xx, yy, Z, alpha=0.8, cmap=plt.cm.coolwarm)
plt.scatter(X[:, 0], X[:, 1], c=y, edgecolors=’k’, marker=’o’, cmap=plt.cm.coolwarm)
plt.xlabel(’Study Hours’)
plt.ylabel(’Previous Exam Score’)
plt.title("Naive Bayes Decision Boundary")

plt.savefig(’naive_bayes_boundary.png’)
plt.clf()

Result:
The Naive Bayes classifier was successfully implemented and evaluated. The accuracy, precision, and recall
were calculated.

Viva Questions and Answers


1. What is the Naive Bayes classifier?
Naive Bayes is a probabilistic classifier based on Bayes’ Theorem. It assumes that the features are
independent of each other, which is why it’s called ”naive.” Despite this assumption, it performs well
in various classification tasks.

2. What is Bayes’ Theorem?


Bayes’ Theorem is a mathematical formula used to calculate the conditional probability of an event
based on prior knowledge of related conditions. It is expressed as:
P (B|A) · P (A)
P (A|B) =
P (B)
Where:

• P (A|B) is the posterior probability of event A occurring given B is true.

2 CEK
AML311 - PML Lab

• P (B|A) is the likelihood of event B occurring given A is true.


• P (A) is the prior probability of event A.
• P (B) is the prior probability of event B.

3. What are the assumptions made by the Naive Bayes classifier?


The primary assumption is that the features (or attributes) used to predict the target class are
independent of each other. This assumption is rarely true in real-world data but the classifier still
often works well in practice.

4. How is the Naive Bayes classifier different from other classifiers like Logistic Regression?
Naive Bayes is a probabilistic model based on Bayes’ Theorem, while logistic regression is a linear
model that predicts the probability of class membership using a logistic function. Naive Bayes
assumes feature independence, whereas logistic regression does not assume any particular relationship
between the features.

5. What is a confusion matrix?


A confusion matrix is a table used to evaluate the performance of a classification algorithm. It dis-
plays the actual vs predicted classifications, helping in calculating metrics such as accuracy, precision,
recall, and F1-score. The matrix consists of four key values:

• True Positives (TP)


• False Positives (FP)
• True Negatives (TN)
• False Negatives (FN)

6. How is accuracy calculated using a confusion matrix?


Accuracy is the ratio of correctly predicted instances (both true positives and true negatives) to the
total number of predictions:
TP + TN
Accuracy =
TP + FP + TN + FN

7. What is precision and how is it calculated?


Precision is the ratio of true positive predictions to the total number of positive predictions (both
true and false). It is calculated as:
TP
Precision =
TP + FP
It answers the question: ”Of the instances predicted as positive, how many were actually positive?”

8. What is recall and how is it calculated?


Recall (or Sensitivity) is the ratio of true positive predictions to the total number of actual positive
instances. It is calculated as:
TP
Recall =
TP + FN
It answers the question: ”Of the instances that were actually positive, how many did the classifier
correctly identify?”

9. What are the limitations of the Naive Bayes classifier?


One of the main limitations is the assumption of feature independence, which is not always true
in real-world datasets. Additionally, Naive Bayes may not perform well when there are complex
relationships between features.

3 CEK
Experiment 9: Decision Tree-based ID3 Algorithm

Aim
To write a Python program to demonstrate the working of the Decision Tree using the ID3 algorithm.

Algorithm
1. Load the dataset: Load a dataset that can be used for building the decision tree. This dataset should
have labeled data, where each sample consists of features and the corresponding target class.

2. Split the dataset: Divide the dataset into training and testing sets to evaluate the performance of
the decision tree.

3. Build the Decision Tree:

• Use the ID3 algorithm to construct the decision tree.


• The decision tree is built by recursively selecting the feature that provides the highest informa-
tion gain (based on entropy) to split the data.

4. Train the Model: Use the training dataset to train the decision tree model.

5. Classify New Data: Input a new data sample and classify it using the trained decision tree.

6. Evaluate the Model: Use the test data to evaluate the accuracy of the model.

7. Plot and save the structure of the decision tree.

8. Display the accuracy, precision, and recall, along with image of the decision tree.

Program
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, confusion_matrix
from sklearn import tree
import matplotlib.pyplot as plt

data = pd.read_csv(’student_performance.csv’)

X = data[[’Study_Hours’, ’Assignments_Completed’, ’Attendance’]]


y = data[’Performance’]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

1
AML311 - PML Lab

clf = DecisionTreeClassifier(criterion=’entropy’)

clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)


precision = precision_score(y_test, y_pred, average=’weighted’)
recall = recall_score(y_test, y_pred, average=’weighted’)
conf_matrix = confusion_matrix(y_test, y_pred)

print(f"Accuracy: {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print("Confusion Matrix:")
print(conf_matrix)

plt.figure(figsize=(12,8))
tree.plot_tree(clf, feature_names=[’Study_Hours’, ’Assignments_Completed’, ’Attendance’],
class_names=[’Low’, ’Medium’, ’High’], filled=True)
plt.title("Decision Tree Visualization")
plt.savefig(’decision_tree_visualization.png’)
plt.clf()

Result
The Decision Tree classifier is successfully trained on the dataset and the model’s accuracy, precision, and
recall are calculated using the test set.

Viva Questions and Answers


1. What is a decision tree?
A decision tree is a supervised learning algorithm that splits data into subsets based on feature values
to make decisions and classify data. It forms a tree-like structure where internal nodes represent
features, branches represent decision rules, and leaves represent outcomes or classes.

2. What is the ID3 algorithm?


The ID3 (Iterative Dichotomiser 3) algorithm is a decision tree algorithm that builds trees by re-
cursively selecting the feature that maximizes information gain (a measure of entropy reduction) for
splitting the dataset. It selects the most discriminative feature at each node.

3. What are the criteria used for splitting nodes in the ID3 algorithm?
ID3 uses information gain, which is based on the concept of entropy. The feature with the highest
information gain is selected for splitting the node.

4. What is entropy in the context of decision trees?


Entropy is a measure of uncertainty or randomness in the data. It indicates how pure or impure a
dataset is. If all elements in a set belong to the same class, the entropy is 0 (pure); if the elements
are randomly distributed among classes, the entropy is high.

2 CEK
AML311 - PML Lab

5. What is information gain?


Information gain is the reduction in entropy after splitting a dataset on a particular feature. It
measures how well a feature separates the classes. A feature with high information gain is chosen
for splitting at a node.

6. What are the common metrics used to evaluate a decision tree model?
Common metrics include:

• Accuracy: The proportion of correctly predicted instances.


• Precision: The proportion of true positive predictions out of all positive predictions.
• Recall: The proportion of true positive predictions out of all actual positive instances.
• Confusion Matrix: A matrix showing the counts of true positives, false positives, true nega-
tives, and false negatives for each class.

7. What is overfitting, and how can it affect decision trees?


Overfitting occurs when a model is too complex and captures noise in the training data, leading to
poor generalization on unseen data. In decision trees, overfitting can happen if the tree is grown too
deep, leading to high variance.

8. How can you prevent overfitting in decision trees?


Overfitting can be prevented by:

• Pruning: Reducing the size of the tree by removing nodes that provide little information gain.
• Setting a maximum depth: Limiting the depth of the tree to control its complexity.
• Minimum samples for split: Requiring a minimum number of samples for splitting a node.
• Cross-validation: Using validation data to tune parameters and avoid overfitting.

9. What is pruning in decision trees?


Pruning is the process of removing sections of the tree that provide little predictive power to reduce
overfitting.

10. What are some real-world applications of decision trees?

• Customer segmentation: Classifying customers based on purchasing behavior.


• Medical diagnosis: Classifying diseases based on patient symptoms and test results.
• Fraud detection: Identifying fraudulent transactions or activities.
• Loan approval: Classifying whether a loan applicant is likely to default.

3 CEK
Experiment 10: Support Vector Machine Classifier

Aim
Write a Python program to implement a Support Vector Machine (SVM) classifier to classify a dataset
and evaluate the accuracy.

Algorithm
1. Load the dataset from a CSV file.

2. Separate the dataset into features (input variables) and target (output variable).

3. Split the dataset into training and testing sets.

4. Initialize a Support Vector Machine (SVM) classifier.

5. Train the classifier on the training data.

6. Predict the target values for the testing set.

7. Evaluate the model using metrics such as accuracy, precision, and recall.

8. Plot the decision boundary and save it to a file.

Program
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score, confusion_matrix

data = pd.read_csv(’student_performance_svm.csv’)

X = data[[’Hours_Studied’, ’Previous_Score’]].values
y = data[’Pass’].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

model = SVC(kernel=’linear’)

model.fit(X_train, y_train)

1
AML311 - PML Lab

y_pred = model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)


precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)

print(f"Accuracy: {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print("Confusion Matrix:")
print(conf_matrix)

x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1


y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.1),
np.arange(y_min, y_max, 0.1))
Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

plt.figure(figsize=(10, 6))
plt.contourf(xx, yy, Z, alpha=0.8, cmap=plt.cm.coolwarm)
plt.scatter(X[:, 0], X[:, 1], c=y, edgecolors=’k’, marker=’o’, cmap=plt.cm.coolwarm)
plt.xlabel(’Hours Studied’)
plt.ylabel(’Previous Score’)
plt.title("SVM Decision Boundary")
plt.savefig(’svm_decision_boundary.png’)
plt.clf()

Result
The SVM classifier was successfully implemented to classify the dataset. The accuracy, precision, and
recall were calculated.

Viva Questions and Answers


1. What is a Support Vector Machine (SVM)?
SVM is a supervised machine learning algorithm used for classification and regression tasks. It works
by finding a hyperplane that best separates different classes in the data.

2. How does an SVM classify data?


SVM finds the optimal hyperplane that maximizes the margin between different classes. The data
points that are closest to the hyperplane are called support vectors, and these help define the decision
boundary.

3. What is the purpose of the kernel in an SVM?


The kernel function allows SVM to perform classification in a higher-dimensional space, enabling it
to handle non-linearly separable data. Common kernels include linear, polynomial, and RBF (Radial
Basis Function).

2 CEK
AML311 - PML Lab

4. What does the ’linear’ kernel do in SVM?


The linear kernel is used when the data is linearly separable, meaning it can be separated by a
straight line or hyperplane. It performs classification by constructing a linear decision boundary.

5. What is a hyperplane in the context of SVM?


A hyperplane is a decision boundary that separates different classes of data. In a two-dimensional
space, it is a line, and in a three-dimensional space, it is a plane.

6. What are support vectors?


Support vectors are the data points that lie closest to the decision boundary. They are critical in
determining the position and orientation of the hyperplane.

7. What is the difference between a linear and non-linear SVM?


A linear SVM uses a straight hyperplane to separate classes, while a non-linear SVM uses a kernel
function (such as RBF or polynomial) to map data into higher dimensions, allowing more complex
decision boundaries.

8. How is the decision boundary determined in SVM?


The decision boundary is determined by the support vectors, and it is placed to maximize the margin
between the classes.

9. What are the advantages of using SVM?


SVM is effective for high-dimensional spaces, works well with a clear margin of separation, is effective
even when the number of features exceeds the number of samples, and can use different kernel
functions to handle non-linear data.

10. What are the disadvantages of using SVM?


SVM can be computationally intensive, especially for large datasets, is sensitive to the choice of
kernel and its parameters, and does not perform well with overlapping classes or noisy data.

3 CEK
Experiment 11: K-Nearest Neighbor Algorithm

Aim
To implement the K-Nearest Neighbor (KNN) algorithm to classify a dataset and evaluate the classification
accuracy.

Algorithm
1. Read the dataset from a CSV file and load it into a pandas DataFrame.

2. Preprocess the data by extracting features and target labels.

3. Normalize the feature data to ensure all features are on the same scale.

4. Split the dataset into training and testing sets.

5. Import the K-Nearest Neighbor classifier from the sklearn library.

6. Instantiate the KNN classifier with a chosen number of neighbours (k).

7. Fit the classifier to the training data.

8. Predict the class labels for the test data.

9. Calculate the accuracy, precision, and recall of the classifier.

10. Save and display the confusion matrix.

Program
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

data = pd.read_csv(’student_performance_knn.csv’)

X = data[[’StudyHours’, ’Attendance’]].values
y = data[’Performance’].map({’Low’: 0, ’Medium’: 1, ’High’: 2}).values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

1
AML311 - PML Lab

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

knn = KNeighborsClassifier(n_neighbors=5)

knn.fit(X_train, y_train)

y_pred = knn.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)


precision = precision_score(y_test, y_pred, average=’weighted’)
recall = recall_score(y_test, y_pred, average=’weighted’)
conf_matrix = confusion_matrix(y_test, y_pred)

print(f"Accuracy: {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")

plt.figure(figsize=(6,6))
sns.heatmap(conf_matrix, annot=True, fmt=’d’, cmap=’Blues’,
xticklabels=[’Low’, ’Medium’, ’High’], yticklabels=[’Low’, ’Medium’, ’High’])
plt.title(’Confusion Matrix’)
plt.xlabel(’Predicted’)
plt.ylabel(’Actual’)
plt.savefig(’confusion_matrix_knn.png’)
plt.close()

Result
The K-Nearest Neighbor (KNN) classifier was successfully implemented to classify the given dataset. The
accuracy, precision, and recall values were calculated.

Viva Questions and Answers


1. What is the K-Nearest Neighbor (KNN) algorithm?
KNN is a simple, non-parametric, and instance-based learning algorithm used for classification and
regression. It classifies a new sample by finding the ’K’ nearest data points in the training data and
assigning the majority class among the neighbors.
2. What is the role of ’K’ in KNN?
The value of ’K’ in KNN represents the number of nearest neighbors considered to classify a new
point. A larger value of ’K’ reduces the effect of noise but may also smooth out the boundaries
between classes.
3. How is the distance between neighbors calculated in KNN?
Common distance metrics include Euclidean distance, Manhattan distance, and Minkowski distance.
Euclidean distance is the most commonly used, which is the straight-line distance between two points
in a multi-dimensional space.
4. What happens if K is too small or too large?
If K is too small (e.g., K=1), the model becomes sensitive to noise and may overfit the training

2 CEK
AML311 - PML Lab

data. If K is too large, the model may oversimplify the decision boundary, leading to underfitting
and reduced accuracy.

5. Is KNN a parametric or non-parametric algorithm?


KNN is a non-parametric algorithm because it does not assume any underlying probability distribu-
tion for the data. Instead, it makes decisions based on the proximity of neighboring data points.

6. What is meant by feature scaling in KNN? Why is it important?


Feature scaling ensures that all features contribute equally to the distance calculation. Since KNN
relies on distance measurements, features with larger scales may dominate the results if scaling is
not applied.

7. What are the advantages of the KNN algorithm?


Advantages of KNN include simplicity, easy implementation, and effectiveness in low-dimensional
spaces. It also has no assumptions about the underlying data distribution.

8. What are the disadvantages of the KNN algorithm?


Disadvantages include high computation time during classification, sensitivity to the choice of ’K’,
and performance degradation in high-dimensional data due to the curse of dimensionality.

3 CEK
Experiment 12: K-Means Clustering

Aim
To implement the K-Means Clustering algorithm using a given dataset and evaluate the clustering results.

Algorithm
1. Load the dataset containing the features to be clustered.

2. Prepare the data by cleaning or normalizing if necessary.

3. Choose the number of clusters, k.

4. Initialize k cluster centroids randomly from the data points.

5. For each data point, compute the distance from the centroids and assign it to the nearest cluster.

6. After all points are assigned, recompute the centroid of each cluster by averaging the points in that
cluster.

7. Repeat the assign-recompute process until the centroids no longer change or the change is minimal.

8. Plot the data points and the resulting clusters.

Program
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
import numpy as np

data = pd.read_csv(’student_performance_kmeans.csv’)

X = data[[’Math_Score’, ’Science_Score’]].values

kmeans = KMeans(n_clusters=3, random_state=42)


kmeans.fit(X)

labels = kmeans.labels_
centroids = kmeans.cluster_centers_

plt.figure(figsize=(10, 6))
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap=’rainbow’, edgecolor=’k’, marker=’o’)
plt.scatter(centroids[:, 0], centroids[:, 1], c=’black’, s=200, alpha=0.75)

1
AML311 - PML Lab

plt.xlabel(’Math Score’)
plt.ylabel(’Science Score’)
plt.title(’K-Means Clustering of Student Performance’)
plt.savefig(’kmeans_clusters.png’)
plt.clf()

Result
The K-Means Clustering algorithm was successfully implemented and the clustering results were visualized.

Viva Questions and Answers


1. What is K-Means Clustering?
K-Means Clustering is an unsupervised learning algorithm used to partition a dataset into k clusters,
where each data point belongs to the cluster with the nearest mean (centroid).

2. How does K-Means Clustering work?


The algorithm works by first initializing k centroids randomly from the dataset. Each data point is
assigned to the nearest centroid to form clusters. The centroids are then recomputed as the mean
of the data points in each cluster. This process is repeated until the centroids stop changing or a
stopping condition is met.

3. What are centroids in K-Means Clustering?


Centroids are the central points of each cluster. They represent the average position of all the data
points within that cluster.

4. What are the limitations of K-Means Clustering?


Some limitations include: it requires the number of clusters to be specified in advance, it may
converge to local minima, and it is sensitive to the initial placement of centroids.

5. What is meant by ‘unsupervised learning’ ?


In unsupervised learning, the algorithm is provided with data that is not labeled. The objective is
to identify patterns or groupings in the data without any supervision or pre-labeled classes.

6. What is the difference between K-Means and hierarchical clustering?


K-Means is a partitional clustering algorithm that creates k clusters at once, while hierarchical
clustering creates a hierarchy of clusters either through agglomeration (bottom-up approach) or
division (top-down approach).

7. What are the stopping criteria in K-Means Clustering?


K-Means Clustering can stop when the centroids no longer change, a maximum number of iterations
is reached, or the assignment of points to clusters stabilizes.

2 CEK
Experiment 13: Artificial Neural Network using Backpropagation

Aim
To implement an Artificial Neural Network (ANN) using the Backpropagation algorithm and test it on a
given dataset.

Algorithm
1. Load the dataset and preprocess it as required.

2. Split the dataset into training and testing sets.

3. Initialize the Artificial Neural Network. Set the number of input neurons, hidden neurons, and
output neurons, and initialize weights and biases randomly for each layer.

4. For each layer in the network, Compute the weighted sum of inputs and apply the activation function
(ReLU), and pass the outputs to the next layer.

5. Calculate the error at the output layer using the loss function and propagate the error back through
the network.

6. Train the ANN using the backpropagation algorithm by feeding the training data.

7. Update the weights of the network during training to minimize the error using gradient descent.

8. Once the network is trained, predict the target values for the test set.

9. Evaluate the model using accuracy and confusion matrix.

Program
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.metrics import accuracy_score, confusion_matrix
import matplotlib.pyplot as plt

def sigmoid(x):
return 1 / (1 + np.exp(-x))

def sigmoid_derivative(x):
return x * (1 - x)

class NeuralNetwork:

1
AML311 - PML Lab

def __init__(self, input_size, hidden_size, output_size):


self.weights_input_hidden = np.random.uniform(size=(input_size, hidden_size))
self.weights_hidden_output = np.random.uniform(size=(hidden_size, output_size))
self.bias_hidden = np.random.uniform(size=(1, hidden_size))
self.bias_output = np.random.uniform(size=(1, output_size))

def feedforward(self, X):


self.hidden_layer_input = np.dot(X, self.weights_input_hidden) + self.bias_hidden
self.hidden_layer_output = sigmoid(self.hidden_layer_input)
self.output_layer_input = np.dot(self.hidden_layer_output,
self.weights_hidden_output) + self.bias_output
self.output_layer_output = sigmoid(self.output_layer_input)
return self.output_layer_output

def backpropagation(self, X, y, learning_rate):


output_error = y - self.output_layer_output
output_delta = output_error * sigmoid_derivative(self.output_layer_output)
hidden_error = output_delta.dot(self.weights_hidden_output.T)
hidden_delta = hidden_error * sigmoid_derivative(self.hidden_layer_output)
self.weights_hidden_output += self.hidden_layer_output.T.dot(output_delta) *
learning_rate
self.bias_output += np.sum(output_delta, axis=0, keepdims=True) * learning_rate
self.weights_input_hidden += X.T.dot(hidden_delta) * learning_rate
self.bias_hidden += np.sum(hidden_delta, axis=0, keepdims=True) * learning_rate

def train(self, X, y, learning_rate, epochs):


for _ in range(epochs):
self.feedforward(X)
self.backpropagation(X, y, learning_rate)

def predict(self, X):


return self.feedforward(X)

data = pd.read_csv(’student_performance_ann.csv’)

le = LabelEncoder()
data[’Result’] = le.fit_transform(data[’Result’])

X = data[[’Feature1’, ’Feature2’, ’Feature3’]].values


y = pd.get_dummies(data[’Result’]).values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

input_size = X_train.shape[1]
hidden_size = 5
output_size = y_train.shape[1]

2 CEK
AML311 - PML Lab

nn = NeuralNetwork(input_size, hidden_size, output_size)

learning_rate = 0.1
epochs = 1000
nn.train(X_train, y_train, learning_rate, epochs)

y_pred = nn.predict(X_test)
y_pred = np.argmax(y_pred, axis=1)
y_test = np.argmax(y_test, axis=1)

accuracy = accuracy_score(y_test, y_pred)


cm = confusion_matrix(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")
print(f"Confusion Matrix:\n{cm}")
plt.figure(figsize=(5,5))
plt.imshow(cm, interpolation=’nearest’, cmap=plt.cm.Blues)
plt.title(’Confusion Matrix’)
plt.colorbar()
plt.savefig(’confusion_matrix_ann.png’)

Result
The Artificial Neural Network model was successfully implemented using the Backpropagation algorithm.
The accuracy of the model was evaluated, and the confusion matrix was generated.

Viva Questions and Answers


1. What is an Artificial Neural Network (ANN)?
An Artificial Neural Network is a computational model inspired by the way biological neural net-
works in the brain process information. It consists of layers of interconnected neurons, where each
connection has a weight. ANNs are used for tasks like classification, regression, and pattern recog-
nition.

2. What is the Backpropagation algorithm?


The Backpropagation algorithm is a supervised learning technique used for training artificial neural
networks. It computes the gradient of the loss function with respect to each weight by applying
the chain rule, propagating the error backward through the network, and adjusting the weights to
minimize the error.

3. How does Backpropagation work in ANN?


Backpropagation works by comparing the network’s output with the actual target, calculating the
error, and adjusting the weights by propagating the error backward through the network layers. This
process iteratively reduces the error by updating the weights in the direction that minimizes the loss
function.

4. What are the components of an Artificial Neural Network?


The main components of an ANN are:

• Input Layer: Receives the input data.


• Hidden Layers: Intermediate layers where computations are performed using activation func-
tions.

3 CEK
AML311 - PML Lab

• Output Layer: Produces the final output or prediction.


• Weights: Parameters that adjust the influence of inputs.
• Bias: Additional parameter to help shift the activation function.
• Activation Function: Non-linear function applied to neurons’ outputs to introduce non-
linearity.

5. What is an activation function? Why is it important?


An activation function is a mathematical function applied to the output of each neuron. It introduces
non-linearity into the model, enabling the network to learn complex patterns. Common activation
functions include Sigmoid, Tanh, and ReLU (Rectified Linear Unit).

6. What is the purpose of the hidden layers in an ANN?


Hidden layers allow the network to capture and learn complex features from the input data by
combining the outputs of neurons in each layer. The more hidden layers, the deeper the network,
and typically, the more complex the patterns it can learn.

7. What is the role of learning rate in Backpropagation?


The learning rate controls how much the weights are adjusted during training. A small learning rate
may slow down the training process, while a large learning rate can cause the model to overshoot
the optimal weights and not converge properly.

8. What are the advantages of using an ANN over traditional algorithms?


ANNs can model complex, non-linear relationships and can automatically extract features from
raw data without the need for manual feature engineering. They are highly versatile and can be
applied to a wide variety of problems, including image recognition, speech processing, and time-series
prediction.

9. What is overfitting in the context of neural networks?


Overfitting occurs when a model learns not only the underlying patterns in the training data but
also the noise and outliers. As a result, it performs well on the training data but poorly on new,
unseen data.

10. What are the common techniques to prevent overfitting in ANN?


Common techniques include:

• Using a larger dataset.


• Implementing regularization techniques like L1 or L2 regularization.
• Using dropout, which randomly drops neurons during training to prevent over-reliance on cer-
tain paths.
• Early stopping, which stops training once the model’s performance on the validation set stops
improving.

11. What is the role of the MLPClassifier in scikit-learn?


The MLPClassifier (Multi-layer Perceptron) in scikit-learn is used for building and training feedfor-
ward artificial neural networks. It supports multiple hidden layers and uses the backpropagation
algorithm for training.

4 CEK

You might also like