0% found this document useful (0 votes)
32 views45 pages

KJD ML File

Uploaded by

gurutapan69
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views45 pages

KJD ML File

Uploaded by

gurutapan69
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 45

Experiment-1

Objective:
To become familiar with the basics of Python programming by
completing the following tasks:

1. Create a program that adds two numbers to demonstrate the


print function.
2. Write a program that checks whether a number is odd or even,
illustrating conditional statements.
3. Implement a program to explore the use of functions.
4. Write a program to retrieve values from a dictionary.
5. Perform various operations on strings using Python.

Related Theory:

1. Python is an advanced, object-oriented programming language


known for its simplicity and ease of use, making it an ideal choice
for beginners.
2. The first task involves adding two numbers, serving as an example
of the print statement's usage. The second task showcases
conditional statements, which are crucial in programming for
controlling the flow based on different conditions.
3. The third task introduces Python functions, which are reusable
blocks of code designed to carry out specific operations. The
fourth task involves accessing dictionary values, where each
key-value pair is separated by a colon and enclosed within curly
braces.
4. Lastly, various string operations are demonstrated, showing how
Python handles single and double quotes equally.

Code & Outputs:

Adding Two Numbers:

# Program to add two numbers

1
num1 = 5

num2 = 10

sum = num1 + num2

print(f"The sum of {num1} and {num2} is {sum}")

Output:
The sum of 5 and 10 is 15

1.

Check Odd or Even:


# Program to check if the number is odd or even

number = 7

if number % 2 == 0:

print(f"{number} is even.")

else:

print(f"{number} is odd.")

Output:

7 is odd.

2.

Using Functions:
# Program to illustrate the use of functions

def greet(name):

return f"Hello, {name}!"

2
print(greet("Alice"))

Output:

Hello, Alice!

3.

Accessing Dictionary Values:

# Program to access values in a dictionary

student = {

"name": "John",

"age": 20,

"course": "Computer Science"

print(f"Name: {student['name']}, Age:


{student['age']}, Course: {student['course']}")

Output:

Name: John, Age: 20, Course: Computer Science

4.

String Operations:

# Program to demonstrate string operations

message = "Hello, Python!"

print(message.upper()) # Convert to uppercase

print(message.lower()) # Convert to lowercase

3
print(message.replace("Python", "World")) # Replace
a substring

Output:

HELLO, PYTHON!

hello, python!

Hello, World!

Conclusion:
Through these exercises, a range of fundamental programming concepts
was explored, including arithmetic operations, conditional logic,
functions, data structures, and string manipulation. These basic
concepts are crucial for progressing to more advanced Python
programming.

4
Experiment-2

Objective:
To understand and apply various NumPy functions.

Related Theory:

1. NumPy is a comprehensive array-processing package that offers


high-performance operations on multidimensional arrays and
includes a variety of tools for working with these arrays. It is a
cornerstone library for scientific computing in Python and is freely
available as open-source software.
2. Key Features of NumPy:
○ A powerful N-dimensional array object.
○ Advanced broadcasting capabilities for operations.
○ Tools for integrating with C/C++ and Fortran code.
○ Essential functions for linear algebra, Fourier transforms, and
random number generation.

Code & Outputs:

Creating N-dimensional Arrays:

import numpy as np

# Creating a 2D array

array = np.array([[1, 2, 3], [4, 5, 6]])

print("2D Array:")

print(array)

Output:

2D Array:

[[1 2 3]

5
[4 5 6]]

1.

Attributes & Properties:

# Displaying array attributes

print(f"Shape: {array.shape}")

print(f"Size: {array.size}")

print(f"Data type: {array.dtype}")

Output:
Shape: (2, 3)

Size: 6

Data type: int64

2.

Indexing, Slicing & Iteration:

# Slicing the array

sliced_array = array[0, 1:]

print("Sliced Array:")

print(sliced_array)

# Iterating over the array

print("Iterating over Array:")

for row in array:

6
print(row)

Output:
Sliced Array:

[2 3]

Iterating over Array:

[1 2 3]

[4 5 6]

3.

NumPy Operations:
# Performing operations on the array

sum_array = np.sum(array)

mean_array = np.mean(array)

print(f"Sum of array: {sum_array}")

print(f"Mean of array: {mean_array}")

Output:
Sum of array: 21

Mean of array: 3.5

Conclusion:
This experiment provided a detailed exploration of NumPy's capabilities,
demonstrating its efficiency in numerical computation and data
manipulation. By implementing a range of functions, essential concepts
such as array creation, indexing, slicing, broadcasting, and mathematical
operations were covered, highlighting NumPy’s critical role in scientific
computing.

7
Experiment -03

OBJECTIVE: To demonstrate graphical visualizations in python using


matplotlib and seaborn libraries

RELATED THEORY

i) Matplothb is a Python 2D plotting library which provides both a very


quick way to visualize data from Python and publicatiowiquality figures
in many formats. Matplotlib can be used in Python scripts, the Python
and Python shells, the Jupytef notebook, web application servers, and
four graphical user interface toollats.

(ii) Matplothb tries to make easy things easy and hard things possible.
You can generate plots, histograms, power spectra, bar charts, error
charts, scatterplots, etc. with just a few lines of code. For examples, see
the sample plots and thumbnail gallery.

(iii) Seaborn is an amazing visualization library for statistical graphics


plotting in Python. It provides beautiful default styles and color palettes
to make statistical plots more attractive It Is built on the top of
matplotlib library and also closely integrated to the data structures from
pandas. Seaborn alms to make visualization the central part of exploring
and understanding data. It provides dataseForiented APIs, so that we
can switch between different visual representations for same variables
for better.

#Code:
# Import necessary libraries

import pandas as pd

import matplotlib.pyplot as plt

import seaborn as sns

# Create a sample dataset using pandas

8
data = {

'Year': [2015, 2016, 2017, 2018, 2019, 2020, 2021],

'Sales': [250, 300, 350, 400, 450, 500, 600],

'Profit': [20, 25, 28, 35, 40, 50, 65],

'Expenses': [200, 210, 220, 230, 250, 270, 290]

df = pd.DataFrame(data)

# Matplotlib - Bar plot for Sales over Years

plt.figure(figsize=(8, 6))

plt.bar(df['Year'], df['Sales'], color='skyblue')

plt.title('Bar Plot: Sales Over the Years')

plt.xlabel('Year')

plt.ylabel('Sales')

plt.grid(True)

plt.show()

# Matplotlib - Multiple Line Plot (Sales, Profit, Expenses over Years)

plt.figure(figsize=(10, 6))

plt.plot(df['Year'], df['Sales'], marker='o', label='Sales',


color='blue')

plt.plot(df['Year'], df['Profit'], marker='o', label='Profit',


color='green')

plt.plot(df['Year'], df['Expenses'], marker='o', label='Expenses',


color='red')

9
plt.title('Multiple Line Plot: Sales, Profit, and Expenses Over the
Years')

plt.xlabel('Year')

plt.ylabel('Amount')

plt.legend()

plt.grid(True)

plt.show()

# Seaborn - Bar plot (with comparison of multiple variables)

plt.figure(figsize=(10, 6))

df_melted = df.melt('Year', var_name='Metric', value_name='Amount')

sns.barplot(x='Year', y='Amount', hue='Metric', data=df_melted)

plt.title('Sales, Profit, and Expenses Over the Years (Bar Plot)')

plt.grid(True)

plt.show()

# Seaborn - Multiple Line Plot using seaborn

plt.figure(figsize=(10, 6))

sns.lineplot(x='Year', y='value', hue='variable', data=pd.melt(df,


['Year']))

plt.title('Sales, Profit, and Expenses Over the Years (Line Plot)')

plt.grid(True)

plt.show()

# Seaborn - Heatmap for correlation between Sales, Profit, and Expenses

plt.figure(figsize=(8, 6))

10
sns.heatmap(df[['Sales', 'Profit', 'Expenses']].corr(), annot=True,
cmap='coolwarm', linewidths=0.5)

plt.title('Correlation Heatmap: Sales, Profit, and Expenses')

plt.show()

Output:

CONCLUSION :
In the above program we used matplotlib and seaborn libraries to plot various
visualizations. These visualizations include plotting line plot, bar chart (both
horizontal and vertical) and heatmaps. Several customization options have been
displayed and discussed in the demonstrations above.

11
Experiment-04

Objective: Read Display and Save an image using OpenCV

Related Theory:
OpenCV: OpenCV (Open Source Computer Vision Library) is a library of programming
functions primarily aimed at real-time computer vision. It provides various
functionalities to manipulate and process images and videos.

#Code:

# Import necessary libraries


import cv2
import numpy as np
import urllib.request
import matplotlib.pyplot as plt

# Function to load an image from a URL


def load_image_from_url(url):
# Create a request object with a custom User-Agent header
req = urllib.request.Request(url, headers={'User-Agent':
'Mozilla/5.0'}) # Mimic a browser request

try:
# Open the URL and read the image as a byte array
resp = urllib.request.urlopen(req) # Use the request object
image_array = np.asarray(bytearray(resp.read()), dtype="uint8")
# Decode the byte array to an image
img = cv2.imdecode(image_array, cv2.IMREAD_COLOR)
return img
except urllib.error.HTTPError as e:
print(f"Error: Unable to load image from URL: {e}")
return None

# Provide the URL of the image


image_url = 'https://i.postimg.cc/mZmnpJVZ/image.png' # Replace with
the actual image URL

# Step 1: Load the image from the URL


img_original = load_image_from_url(image_url)

12
# Check if the image was successfully loaded
if img_original is None:
print("Error: Unable to load the image from URL.")
else:
# Step 2: Process the image (Convert it to grayscale for
demonstration)
img_gray = cv2.cvtColor(img_original, cv2.COLOR_BGR2GRAY)

# Step 3: Display the original and processed images side by side


using matplotlib
plt.figure(figsize=(12, 6))

# Display original image


plt.subplot(1, 2, 1)
plt.imshow(cv2.cvtColor(img_original, cv2.COLOR_BGR2RGB)) #
Convert BGR to RGB for display
plt.title('Original Image')
plt.axis('off')
# Display processed image (Grayscale)
plt.subplot(1, 2, 2)
plt.imshow(img_gray, cmap='gray')
plt.title('Processed Image (Grayscale)')
plt.axis('off')
# Show the images
plt.show()
# Step 4: Optionally save the processed image
save_path = 'processed_image.jpg'
cv2.imwrite(save_path, img_gray)
print(f"Processed image saved as '{save_path}'")

Output:

Conclusion: We demonstrated how to use OpenCV in Python to read an image


from a file, display it in a window, and save it to a new file. This basic functionality

13
serves as the foundation for many image processing tasks that can be
accomplished using OpenCV.

14
Experiment-05:
OBJECTIVE: To Implement Linear Regression.

Related Theory: In Linear Regression, we assume a linear relationship between the


input variables (x) and the single output vanable (y) In linear model, we try to ht a line
(y=mx+c) to the given data, in such a manner that it has minimum error. The general
line Is written as: ythat)=w0+ WI • x
• Cost function is given by.

#Code:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import mean_squared_error, r2_score

# Create a larger dataset


data = {
'Year': [2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018,
2019, 2020, 2021, 2022],
'Sales': [150, 200, 220, 250, 280, 300, 350, 400, 450, 500, 550,
600, 650],
'Profit': [10, 15, 18, 20, 22, 25, 28, 35, 40, 45, 50, 65, 70],
'Expenses': [100, 120, 150, 170, 180, 210, 220, 230, 250, 270, 280,
290, 300]
}

df = pd.DataFrame(data)

# Split data into training and testing sets


X_sales = df[['Sales']] # Simple linear regression (using only Sales)
X_all = df[['Sales', 'Expenses']] # Multiple linear regression (using
Sales and Expenses)
y = df['Profit']

# Train-test split for both models

15
X_train_sales, X_test_sales, y_train, y_test =
train_test_split(X_sales, y, test_size=0.2, random_state=42)
X_train_all, X_test_all, _, _ = train_test_split(X_all, y,
test_size=0.2, random_state=42)

### 1. Simple Linear Regression (Sales -> Profit)

# Create and train linear regression model


simple_lr = LinearRegression()
simple_lr.fit(X_train_sales, y_train)

# Predict using the model


y_pred_simple = simple_lr.predict(X_test_sales)

# Visualize the results


plt.figure(figsize=(8, 6))
plt.scatter(X_test_sales, y_test, color='blue', label='Actual')
plt.plot(X_test_sales, y_pred_simple, color='red', label='Predicted')
plt.title('Simple Linear Regression: Sales vs Profit')
plt.xlabel('Sales')
plt.ylabel('Profit')
plt.legend()
plt.grid(True)
plt.show()

# Print metrics for simple linear regression


print(f"Simple Linear Regression - MSE: {mean_squared_error(y_test,
y_pred_simple):.2f}, R2: {r2_score(y_test, y_pred_simple):.2f}")

### 2. Multiple Linear Regression (Sales, Expenses -> Profit)

# Create and train multiple linear regression model


multiple_lr = LinearRegression()
multiple_lr.fit(X_train_all, y_train)

# Predict using the model


y_pred_multiple = multiple_lr.predict(X_test_all)

# Print metrics for multiple linear regression


print(f"Multiple Linear Regression - MSE: {mean_squared_error(y_test,
y_pred_multiple):.2f}, R2: {r2_score(y_test, y_pred_multiple):.2f}")

# Visualize actual vs predicted values for multiple regression

16
plt.figure(figsize=(8, 6))
plt.scatter(df['Sales'], y, color='blue', label='Actual')
plt.scatter(df['Sales'], multiple_lr.predict(df[['Sales',
'Expenses']]), color='green', label='Predicted')
plt.title('Multiple Linear Regression: Sales, Expenses vs Profit')
plt.xlabel('Sales')
plt.ylabel('Profit')
plt.legend()
plt.grid(True)
plt.show()

### 3. Polynomial Regression (Sales -> Profit with polynomial degree)

# Create polynomial features for Sales (degree 2)


poly = PolynomialFeatures(degree=2)
X_poly = poly.fit_transform(X_sales)

# Train-test split for polynomial regression


X_train_poly, X_test_poly, y_train_poly, y_test_poly =
train_test_split(X_poly, y, test_size=0.2, random_state=42)

# Train the model on polynomial features


poly_lr = LinearRegression()
poly_lr.fit(X_train_poly, y_train_poly)

# Predict using polynomial regression model


y_pred_poly = poly_lr.predict(X_test_poly)

# Visualize the polynomial regression fit


plt.figure(figsize=(8, 6))
plt.scatter(X_sales, y, color='blue', label='Actual')
plt.plot(X_sales, poly_lr.predict(X_poly), color='red',
label='Polynomial Fit')
plt.title('Polynomial Regression (Degree 2): Sales vs Profit')
plt.xlabel('Sales')
plt.ylabel('Profit')
plt.legend()
plt.grid(True)
plt.show()

# Print metrics for polynomial regression

17
print(f"Polynomial Regression (Degree 2) - MSE:
{mean_squared_error(y_test_poly, y_pred_poly):.2f}, R2:
{r2_score(y_test_poly, y_pred_poly):.2f}")

Output:

Conclusion: From the above program, we conclude that


regression is a statistical process for estimating the
relationship between dependent variable and one or more
independent variable.

18
Experiment-06
Objective: To Implement Logistic Regression.

Related Theory: Logistic regression is a predictive analysis.


Logistic regression is used to describe data and to explain
the relationship between one dependent variable and one or
more independent variables. A standard logistic function is
called sigmoid function given by I / (1 + e^-value) where is
the base of the natural logarithms (Euler's number) and value
is the actual numerical value that you want to transform. The
cost function is given by:

#Code:
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix,
classification_report

# Step 1: Generate synthetic dataset


np.random.seed(42) # For reproducibility

# Generate features
X1 = np.random.normal(0, 1, 500) # Feature 1
X2 = np.random.normal(0, 1, 500) # Feature 2

# Generate labels based on a linear combination of the features


Y = (X1 + X2 > 0).astype(int) # Labels: 1 if X1 + X2 > 0, else 0

# Create a DataFrame
df = pd.DataFrame({'Feature1': X1, 'Feature2': X2, 'Label': Y})

# Step 2: Prepare the data (features and labels)


X = df[['Feature1', 'Feature2']] # Features
y = df['Label'] # Target variable

19
# Step 3: Split the dataset into training and testing sets (80%
training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2, random_state=42)

# Step 4: Create and train the logistic regression model


logreg = LogisticRegression()
logreg.fit(X_train, y_train)

# Step 5: Make predictions on the test set


y_pred = logreg.predict(X_test)

# Step 6: Evaluate the model


accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')

# Confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)
print(f'Confusion Matrix:\n{conf_matrix}')

# Detailed classification report


print(f'Classification Report:\n{classification_report(y_test,
y_pred)}')

# Step 7: Visualize the results and the decision boundary


plt.figure(figsize=(10, 6))

# Scatter plot of the data points


sns.scatterplot(data=df, x='Feature1', y='Feature2', hue='Label',
palette='Set1', alpha=0.6)

# Create a grid of points to plot the decision boundary


xlim = plt.xlim() # Get the current x-axis limits
ylim = plt.ylim() # Get the current y-axis limits
xx, yy = np.meshgrid(np.linspace(xlim[0], xlim[1], 100),
np.linspace(ylim[0], ylim[1], 100))
Z = logreg.predict(np.c_[xx.ravel(), yy.ravel()]) # Predict the label
for each point in the grid
Z = Z.reshape(xx.shape) # Reshape back to grid

# Plot the decision boundary


plt.contourf(xx, yy, Z, alpha=0.3, cmap='coolwarm')
plt.title('Logistic Regression Decision Boundary')

20
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend(title='Label')
plt.show()

Output:

Accuracy: 0.99
Confusion Matrix:
[[42 1]
[ 0 57]]
Classification Report:
precision recall f1-score support

0 1.00 0.98 0.99 43


1 0.98 1.00 0.99 57

accuracy 0.99 100


macro avg 0.99 0.99 0.99 100
weighted avg 0.99 0.99 0.99 100

CONCLUSION: From the above program. we ice Logistic regression


is used when the response vanable is categorical in nature.
For instance. yesno. tme..talse. retrgreenblue,
Isv2rx1,1rd'adi, etc. wherear. Linear regression ix used when
your response variable is continuou,. For instance. weight,
height, number of hours, etc. Linear regresskin uses ordinary
least squares method to minimize the errors and anive at a
best possible while logistic regreedon uses maximum likelihood
method to anive at the 5.0itlikle. Hence. logistic regression
is belief than linear regression.

21
Experiment-07
Objective: To process data using Pandas.

RELATED THEORY: pandas is a popular Python library used for


data science and analysis. Used in conjunction with other data
science toolsets like Say NumPy, and Matplotlib, a modeler
can create end-to-end analytic workflows to solve business
problems. Many datasets have missing, malformed, or erroneous
data. It's often unavoidable—anything from incomplete
reporting to technical glitches can cause :ditty" data. Pandas
provides a robust library of functions to help you clean up,
sort through, and make sense of your datasets, no matter what
state they're in. The dataset of1,000 movies scraped from !MDR
is used for pre-processing. It contains information on the
actors, directors, budget, and gross, as well as the IMDB
rating and release year.

#Code:
# Import necessary libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

# Step 1: Load Data


# For demonstration, let's create a synthetic dataset
data = {
'Age': [25, 30, 35, np.nan, 40, 45, 50, 55, 60, 65, 70, 75, 80,
np.nan, 90],
'Gender': ['Male', 'Female', 'Female', 'Male', 'Male', 'Female',
'Male', 'Female', 'Male', 'Female', 'Male', 'Female', 'Male', 'Female',
'Male'],
'Cholesterol': [200, 240, 230, 210, 220, 250, np.nan, 240, 230,
200, 210, np.nan, 250, 260, 270],
'Has_Disease': [0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1]
}

# Create a DataFrame
df = pd.DataFrame(data)

# Step 2: Explore Data


print("Data Summary:")
print(df.describe()) # Summary statistics

22
print("\nData Types:")
print(df.dtypes) # Data types
print("\nInitial Data:")
print(df)

# Step 3: Handling Missing Values


# Check for missing values
print("\nMissing Values:")
print(df.isnull().sum())

# Fill missing values for 'Age' with the mean age and 'Cholesterol'
with the median
df['Age'].fillna(df['Age'].mean(), inplace=True)
df['Cholesterol'].fillna(df['Cholesterol'].median(), inplace=True)

# Verify that missing values have been filled


print("\nMissing Values after filling:")
print(df.isnull().sum())

# Step 4: Encoding Categorical Variables


# Convert 'Gender' to numerical values (0 for Male, 1 for Female)
df['Gender'] = df['Gender'].map({'Male': 0, 'Female': 1})

# Step 5: Feature Scaling


# Feature scaling (standardization)
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
df[['Age', 'Cholesterol']] = scaler.fit_transform(df[['Age',
'Cholesterol']])

# Step 6: Splitting the Data into Features and Labels


X = df[['Age', 'Gender', 'Cholesterol']] # Features
y = df['Has_Disease'] # Target variable

# Step 7: Removing Duplicates


df.drop_duplicates(inplace=True)

# Step 8: Outlier Detection and Treatment


# For simplicity, let's consider values more than 3 standard deviations
from the mean as outliers
z_scores = np.abs((df[['Age', 'Cholesterol']] - df[['Age',
'Cholesterol']].mean()) / df[['Age', 'Cholesterol']].std())

23
df = df[(z_scores < 3).all(axis=1)]

# Final dataset after preprocessing


print("\nFinal Preprocessed Data:")
print(df)

# Split the dataset into training and testing sets (80% training, 20%
testing)
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2, random_state=42)

# Check the shapes of the resulting datasets


print("\nShapes of Train and Test Sets:")
print(f"X_train: {X_train.shape}, y_train: {y_train.shape}")
print(f"X_test: {X_test.shape}, y_test: {y_test.shape}")

Output:
Data Summary:
Age Cholesterol Has_Disease
count 13.000000 13.000000 15.000000
mean 55.384615 231.538462 0.600000
std 20.151669 22.673830 0.507093
min 25.000000 200.000000 0.000000
25% 40.000000 210.000000 0.000000
50% 55.000000 230.000000 1.000000
75% 70.000000 250.000000 1.000000
max 90.000000 270.000000 1.000000

Data Types:
Age float64
Gender object
Cholesterol float64
Has_Disease int64
dtype: object

Initial Data:
Age Gender Cholesterol Has_Disease
0 25.0 Male 200.0 0
1 30.0 Female 240.0 1
2 35.0 Female 230.0 0
3 NaN Male 210.0 1
4 40.0 Male 220.0 0
5 45.0 Female 250.0 1
6 50.0 Male NaN 1
7 55.0 Female 240.0 0
8 60.0 Male 230.0 1

24
9 65.0 Female 200.0 1
10 70.0 Male 210.0 0
11 75.0 Female NaN 0
12 80.0 Male 250.0 1
13 NaN Female 260.0 1
14 90.0 Male 270.0 1

Missing Values:
Age 2
Gender 0
Cholesterol 2
Has_Disease 0
dtype: int64

Missing Values after filling:


Age 0
Gender 0
Cholesterol 0
Has_Disease 0
dtype: int64

Final Preprocessed Data:


Age Gender Cholesterol Has_Disease
0 -1.685768e+00 0 -1.544516 0
1 -1.408363e+00 1 0.427207 1
2 -1.130958e+00 1 -0.065724 0
3 3.942160e-16 0 -1.051585 1
4 -8.535533e-01 0 -0.558655 0
5 -5.761485e-01 1 0.920137 1
6 -2.987437e-01 0 -0.065724 1
7 -2.133883e-02 1 0.427207 0
8 2.560660e-01 0 -0.065724 1
9 5.334708e-01 1 -1.544516 1
10 8.108756e-01 0 -1.051585 0
11 1.088280e+00 1 -0.065724 0
12 1.365685e+00 0 0.920137 1
13 3.942160e-16 1 1.413068 1
14 1.920495e+00 0 1.905998 1

Shapes of Train and Test Sets:


X_train: (12, 3), y_train: (12,)
X_test: (3, 3), y_test: (3,)

Conclusion:
we conclude that Panda. has tome ;election methods which you
can use to stge and dice the douses based on your queries. It
inaptly helps in the following data-processing tasks: • Deal
with missing Mu • Add default values • Remove incomplete rows
• Deal with arm prune eoluvm •Normtlite data types •Channt cmc
• Rename columns

25
Experiment-08
Objective: To implement KNN Algorithm.
RELATED THEORY:
The KNN algorithm assumes that similar things exist in close
proximity. In other words, similar things are near to each
other. KNN algorithm: Load the data 1. Initialize K to your
chosen number of neighbors 2. For each example in the data a.
Calculate the distance between the query example and the
current example from the data. b. Add the distance and the
index of the example to an ordered collection 3. Sort the
ordered collection of distances and indices from smallest to
largest (in ascending order) by the distances 4. Pick the
first K entries from the sorted collection 5. Get the labels
of the selected K entries 6. If regression, return the mean of
the K labels 7. If classification, return the mode of the K
labels .

Code:
import numpy as np
import pandas as pd
from collections import Counter
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification
from sklearn.metrics import accuracy_score

# Step 1: Generate a dataset with 1000 samples, 20 features, and 3


classes
X, y = make_classification(
n_samples=1000, # Number of samples
n_features=20, # Number of features
n_classes=3, # Number of classes
n_informative=15, # Number of informative features
random_state=42 # Ensures reproducibility
)

# Step 2: Create a DataFrame with the features and labels


# Naming features as Feature1, Feature2, ..., Feature20
feature_columns = [f'Feature{i+1}' for i in range(X.shape[1])]
df = pd.DataFrame(X, columns=feature_columns)
df['Target'] = y

# Display the first few rows of the DataFrame


print("First 5 rows of the dataset:")
print(df.head())

26
# Step 3: Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2, random_state=42)

# Step 4: Define the Euclidean distance function


def euclidean_distance(x1, x2):
return np.sqrt(np.sum((x1 - x2) ** 2))

# Step 5: Define the KNN prediction function


def knn_predict(X_train, y_train, x_test, k):
distances = [euclidean_distance(x_test, x_train) for x_train in
X_train]
k_indices = np.argsort(distances)[:k]
k_nearest_labels = [y_train[i] for i in k_indices]
most_common = Counter(k_nearest_labels).most_common(1)
return most_common[0][0]

# Step 6: Run KNN on all test samples


def knn_predict_all(X_train, y_train, X_test, k):
return np.array([knn_predict(X_train, y_train, x_test, k) for
x_test in X_test])

# Step 7: Set the number of neighbors and run predictions


k = 5
y_pred = knn_predict_all(X_train, y_train, X_test, k)

# Step 8: Evaluate the accuracy


accuracy = accuracy_score(y_test, y_pred)
print(f"\nKNN Classification Accuracy on 1000 samples: {accuracy *
100:.2f}%")

Output:

27
28
Experiment-09
Objective: To classify data using SVM.
RELATED THEORY:
The objective of the support vector machine algorithm isto
find a hyperplane in an Ndimensional space (N- the no. of
features) that distinctly classifies the data points.

Code:
# Import necessary libraries
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
import pandas as pd

# Step 1: Load the Iris dataset


iris = datasets.load_iris()
X, y = iris.data, iris.target

# Step 2: Create a DataFrame for better visualization (optional)


df = pd.DataFrame(X, columns=iris.feature_names)
df['Target'] = y

# Display the first few rows of the DataFrame


print("First 5 rows of the Iris dataset:")
print(df.head())

# Step 3: Split the data into training and test sets


X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2, random_state=42)

# Step 4: Initialize and train the SVM model

29
svm_model = SVC(kernel='linear') # You can choose other kernels like
'rbf', 'poly', etc.
svm_model.fit(X_train, y_train)

# Step 5: Make predictions on the test set


y_pred = svm_model.predict(X_test)

# Step 6: Evaluate the model's accuracy


accuracy = accuracy_score(y_test, y_pred)
print(f"\nSVM Classification Accuracy on Iris dataset: {accuracy *
100:.2f}%")

Output:

30
Experiment-10
Objective: To implement neural network.
RELATED THEORY:
Neural networks, a beautiful biologically-inspired programming
paradigm which enables a computer to learn from observational
data. An Artificial Neural Network is based on a collection of
connected units or nodes called artificial neurons. ANNs have
been used on a variety of tasks, including computer vision,
speech recognition, machine translation, social network
filtering, playing board and video games and medical
diagnosis.

Code:
import numpy as np
import matplotlib.pyplot as plt

# Helper functions
def sigmoid(x):
return 1 / (1 + np.exp(-x))

def sigmoid_derivative(x):
return x * (1 - x)

def relu(x):
return np.maximum(0, x)

31
def relu_derivative(x):
return np.where(x <= 0, 0, 1)

def softmax(x):
exps = np.exp(x - np.max(x, axis=1, keepdims=True))
return exps / np.sum(exps, axis=1, keepdims=True)

# Neural Network Class


class NeuralNetwork:
def __init__(self, input_size, hidden_size, output_size):
# Initialize weights and biases
self.W1 = np.random.randn(input_size, hidden_size)
self.b1 = np.zeros((1, hidden_size))
self.W2 = np.random.randn(hidden_size, output_size)
self.b2 = np.zeros((1, output_size))

def forward(self, X):


# Forward pass
self.Z1 = np.dot(X, self.W1) + self.b1
self.A1 = relu(self.Z1) # Hidden layer activation
self.Z2 = np.dot(self.A1, self.W2) + self.b2
self.A2 = softmax(self.Z2) # Output layer activation
return self.A2

def backward(self, X, y, output, learning_rate=0.01):


# Compute gradients using backpropagation
m = y.shape[0]
dZ2 = output - y
dW2 = np.dot(self.A1.T, dZ2) / m
db2 = np.sum(dZ2, axis=0, keepdims=True) / m

dA1 = np.dot(dZ2, self.W2.T)


dZ1 = dA1 * relu_derivative(self.A1)
dW1 = np.dot(X.T, dZ1) / m
db1 = np.sum(dZ1, axis=0, keepdims=True) / m

# Update weights and biases


self.W1 -= learning_rate * dW1
self.b1 -= learning_rate * db1
self.W2 -= learning_rate * dW2
self.b2 -= learning_rate * db2

32
def train(self, X, y, epochs=1000, learning_rate=0.01):
for epoch in range(epochs):
output = self.forward(X)
self.backward(X, y, output, learning_rate)

if epoch % 100 == 0:
loss = -np.mean(y * np.log(output))
print(f"Epoch {epoch}, Loss: {loss}")

def predict(self, X):


output = self.forward(X)
return np.argmax(output, axis=1)

# Generating complex data (using circles or moons from


sklearn.datasets)
from sklearn.datasets import make_moons
X, y = make_moons(n_samples=1000, noise=0.2, random_state=42)

# Convert y to one-hot encoded


y_one_hot = np.zeros((y.size, 2))
y_one_hot[np.arange(y.size), y] = 1

# Visualize the data


plt.scatter(X[:, 0], X[:, 1], c=y, cmap='viridis')
plt.title("Generated Data")
plt.show()

# Initialize neural network


input_size = 2 # Two input features
hidden_size = 5 # Hidden layer size
output_size = 2 # Two output classes

nn = NeuralNetwork(input_size, hidden_size, output_size)

# Train the neural network


nn.train(X, y_one_hot, epochs=1000, learning_rate=0.01)

# Predict on the training data


predictions = nn.predict(X)

# Accuracy
accuracy = np.mean(predictions == y)
print(f"Training Accuracy: {accuracy * 100:.2f}%")

33
# Visualizing the decision boundary
def plot_decision_boundary(X, y, model):
x_min, x_max = X[:, 0].min() - 0.5, X[:, 0].max() + 0.5
y_min, y_max = X[:, 1].min() - 0.5, X[:, 1].max() + 0.5
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.01),
np.arange(y_min, y_max, 0.01))
grid = np.c_[xx.ravel(), yy.ravel()]

Z = model.predict(grid)
Z = Z.reshape(xx.shape)

plt.contourf(xx, yy, Z, alpha=0.8, cmap='viridis')


plt.scatter(X[:, 0], X[:, 1], c=y, cmap='viridis', edgecolors='k')
plt.title("Decision Boundary")
plt.show()

# Plot the decision boundary


plot_decision_boundary(X, y, nn)

Output:

34
Experiment-11
Objective: To implement Decision Tree Algorithm on breast
cancer data to predict whether a person is having cancer or
not.

RELATED THEORY:
Decision Tree is the most powerful and popular tool for
classification and prediction. A Decision tree is a
flowchart-like tree structure, where each internal node
denotes a test on an attribute, each branch represents an
outcome of the test, and each leaf node (terminal node) holds
a class label. The strengths of decision tree methods are:
1. Decision trees are able to generate understandable rules.
2. Decision trees perform classification without requiring
much computation.
3. Decision trees are able to handle both continuous and
categorical variables.
4. Decision trees provide a clear indication of which fields
are most important for prediction or classification.
The weaknesses of decision tree methods :
1. Decision trees are less appropriate for estimation tasks
where the goal is to predict the value of a continuous
attribute.
2. Decision trees are prone to errors in classification
problems with many classes and a relatively small number of
training examples.
3. Decision tree can be computationally expensive to train.
The process of growing a decision tree is computationally
expensive. At each node, each candidate splitting field must
be sorted before its best split can be found. In some
algorithms, combinations of fields are used and a search must
be made for optimal combining weights. Pruning algorithms can
also be expensive since many candidate sub-trees must be
formed and compared.

Code:
import numpy as np
import pandas as pd

# Sample dataset (replace this with your data)


data = {
'Feature1': [2.771244718, 1.728571309, 3.678319846, 3.961043357,
2.999208922, 7.497545867, 9.00220326, 7.444542326, 10.12493903,
6.642287351],

35
'Feature2': [1.784783929, 1.169761413, 2.81281357, 2.61995032,
2.209014212, 3.162953546, 3.339047188, 0.476683375, 3.234550982,
3.319983761],
'Label': [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]
}
df = pd.DataFrame(data)

# Function to split dataset


def split_data(index, value, dataset):
left, right = list(), list()
for row in dataset:
if row[index] < value:
left.append(row)
else:
right.append(row)
return left, right

# Calculate Gini Index for a split


def gini_index(groups, classes):
n_instances = float(sum([len(group) for group in groups]))
gini = 0.0
for group in groups:
size = float(len(group))
if size == 0:
continue
score = 0.0
class_counts = [row[-1] for row in group]
for class_val in classes:
p = class_counts.count(class_val) / size
score += p * p
gini += (1.0 - score) * (size / n_instances)
return gini

# Selecting the best split point for a dataset


def get_best_split(dataset):
class_values = list(set(row[-1] for row in dataset))
best_index, best_value, best_score, best_groups = 999, 999, 999,
None
for index in range(len(dataset[0]) - 1):
for row in dataset:
groups = split_data(index, row[index], dataset)
gini = gini_index(groups, class_values)
if gini < best_score:

36
best_index, best_value, best_score, best_groups =
index, row[index], gini, groups
return {'index': best_index, 'value': best_value, 'groups':
best_groups}

# Creating terminal node


def to_terminal(group):
outcomes = [row[-1] for row in group]
return max(set(outcomes), key=outcomes.count)

# Split the node or make it a terminal node


def split(node, max_depth, min_size, depth):
left, right = node['groups']
del(node['groups'])
if not left or not right:
node['left'] = node['right'] = to_terminal(left + right)
return
if depth >= max_depth:
node['left'], node['right'] = to_terminal(left),
to_terminal(right)
return
if len(left) <= min_size:
node['left'] = to_terminal(left)
else:
node['left'] = get_best_split(left)
split(node['left'], max_depth, min_size, depth+1)
if len(right) <= min_size:
node['right'] = to_terminal(right)
else:
node['right'] = get_best_split(right)
split(node['right'], max_depth, min_size, depth+1)

# Building the decision tree


def build_tree(train, max_depth, min_size):
root = get_best_split(train)
split(root, max_depth, min_size, 1)
return root

# Making predictions with the decision tree


def predict(node, row):
if row[node['index']] < node['value']:
if isinstance(node['left'], dict):
return predict(node['left'], row)

37
else:
return node['left']
else:
if isinstance(node['right'], dict):
return predict(node['right'], row)
else:
return node['right']

# Prepare the data


dataset = df.values.tolist()

# Parameters for tree


max_depth = 3
min_size = 1

# Build and train the decision tree


tree = build_tree(dataset, max_depth, min_size)

# Make predictions on the dataset


for row in dataset:
prediction = predict(tree, row)
print(f'Expected={row[-1]}, Predicted={prediction}')

Output:
Expected=0.0, Predicted=0.0
Expected=0.0, Predicted=0.0
Expected=0.0, Predicted=0.0
Expected=0.0, Predicted=0.0
Expected=0.0, Predicted=0.0
Expected=1.0, Predicted=1.0
Expected=1.0, Predicted=1.0
Expected=1.0, Predicted=1.0
Expected=1.0, Predicted=1.0
Expected=1.0, Predicted=1.0

38
Experiment-12
OBJECTIVE: Implementation of Random Forest in Python
RELATED THEORY: Random forest, like its name implies, consists of a
large number of individual decision trees that operate as an ensemble.
Each individual tree in the random forest spits out a class prediction
and the class with the most votes becomes our model’s prediction.

Code:
import numpy as np
import pandas as pd
import random

# Sample dataset (replace this with your data)


data = {
'Feature1': [2.771244718, 1.728571309, 3.678319846, 3.961043357,
2.999208922, 7.497545867, 9.00220326, 7.444542326, 10.12493903,
6.642287351],
'Feature2': [1.784783929, 1.169761413, 2.81281357, 2.61995032,
2.209014212, 3.162953546, 3.339047188, 0.476683375, 3.234550982,
3.319983761],
'Label': [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]
}
df = pd.DataFrame(data)

# Function to split dataset


def split_data(index, value, dataset):
left, right = list(), list()
for row in dataset:
if row[index] < value:
left.append(row)
else:
right.append(row)
return left, right

# Calculate Gini Index for a split


def gini_index(groups, classes):
n_instances = float(sum([len(group) for group in groups]))
gini = 0.0
for group in groups:
size = float(len(group))
if size == 0:
continue
score = 0.0
class_counts = [row[-1] for row in group]

39
for class_val in classes:
p = class_counts.count(class_val) / size
score += p * p
gini += (1.0 - score) * (size / n_instances)
return gini

# Selecting the best split point for a dataset


def get_best_split(dataset, n_features):
class_values = list(set(row[-1] for row in dataset))
features = random.sample(range(len(dataset[0]) - 1), n_features)
best_index, best_value, best_score, best_groups = 999, 999, 999,
None
for index in features:
for row in dataset:
groups = split_data(index, row[index], dataset)
gini = gini_index(groups, class_values)
if gini < best_score:
best_index, best_value, best_score, best_groups =
index, row[index], gini, groups
return {'index': best_index, 'value': best_value, 'groups':
best_groups}

# Creating terminal node


def to_terminal(group):
outcomes = [row[-1] for row in group]
return max(set(outcomes), key=outcomes.count)

# Split the node or make it a terminal node


def split(node, max_depth, min_size, depth, n_features):
left, right = node['groups']
del(node['groups'])
if not left or not right:
node['left'] = node['right'] = to_terminal(left + right)
return
if depth >= max_depth:
node['left'], node['right'] = to_terminal(left),
to_terminal(right)
return
if len(left) <= min_size:
node['left'] = to_terminal(left)
else:
node['left'] = get_best_split(left, n_features)
split(node['left'], max_depth, min_size, depth+1, n_features)

40
if len(right) <= min_size:
node['right'] = to_terminal(right)
else:
node['right'] = get_best_split(right, n_features)
split(node['right'], max_depth, min_size, depth+1, n_features)

# Building the decision tree


def build_tree(train, max_depth, min_size, n_features):
root = get_best_split(train, n_features)
split(root, max_depth, min_size, 1, n_features)
return root

# Making predictions with the decision tree


def predict(node, row):
if row[node['index']] < node['value']:
if isinstance(node['left'], dict):
return predict(node['left'], row)
else:
return node['left']
else:
if isinstance(node['right'], dict):
return predict(node['right'], row)
else:
return node['right']

# Create a random subsample from the dataset with replacement


def subsample(dataset, ratio=1.0):
sample = list()
n_sample = round(len(dataset) * ratio)
while len(sample) < n_sample:
index = random.randrange(len(dataset))
sample.append(dataset[index])
return sample

# Build a random forest model


def random_forest(train, max_depth, min_size, sample_size, n_trees,
n_features):
forest = list()
for _ in range(n_trees):
sample = subsample(train, sample_size)
tree = build_tree(sample, max_depth, min_size, n_features)
forest.append(tree)
return forest

41
# Make a prediction with a random forest
def bagging_predict(forest, row):
predictions = [predict(tree, row) for tree in forest]
return max(set(predictions), key=predictions.count)

# Prepare the data


dataset = df.values.tolist()

# Parameters for random forest


n_trees = 5
max_depth = 3
min_size = 1
sample_size = 0.8
n_features = int(np.sqrt(len(dataset[0]) - 1))

# Build the random forest


forest = random_forest(dataset, max_depth, min_size, sample_size,
n_trees, n_features)

# Make predictions on the dataset


for row in dataset:
prediction = bagging_predict(forest, row)
print(f'Expected={row[-1]}, Predicted={prediction}')

Expected=0.0, Predicted=0.0
Expected=0.0, Predicted=0.0
Expected=0.0, Predicted=0.0
Expected=0.0, Predicted=0.0
Expected=0.0, Predicted=0.0
Expected=1.0, Predicted=1.0
Expected=1.0, Predicted=1.0
Expected=1.0, Predicted=1.0
Expected=1.0, Predicted=1.0
Expected=1.0, Predicted=1.0

42
Experiment 13
OBJECTIVE: Implementation of K Means Clustering algorithm
RELATED THEORY: K-means is an unsupervised learning method for
clustering data points. The algorithm iteratively divides data points
into K clusters by minimizing the variance in each cluster. Here, we
will show you how to estimate the best value for K using the elbow
method, then use K-means clustering to group the data points into
clusters. First, each data point is randomly assigned to one of the K
clusters. Then, we compute the centroid (functionally the center) of
each cluster, and reassign each data point to the cluster with the
closest centroid. We repeat this process until the cluster assignments
for each data point are no longer changing. K-means clustering requires
us to select K, the number of clusters we want to group the data into.
The elbow method lets us graph the inertia (a distance-based metric)
and visualize the point at which it starts decreasing linearly. This
point is referred to as the "elbow" and is a good estimate for the best
value for K based on our data.

Code:
import numpy as np
import matplotlib.pyplot as plt

# Sample data
data = np.array([
[5.9, 3.2],
[4.6, 2.9],
[6.2, 2.8],
[4.7, 3.2],
[5.5, 4.2],
[5.0, 3.0],
[4.9, 3.1],
[6.7, 3.1],
[5.1, 3.8],
[6.0, 3.0]
])

# Function to initialize centroids


def initialize_centroids(data, k):
indices = np.random.choice(data.shape[0], k, replace=False)
centroids = data[indices]
return centroids

# Function to assign clusters


def assign_clusters(data, centroids):
clusters = []
for point in data:

43
distances = [np.linalg.norm(point - centroid) for centroid in
centroids]
cluster = np.argmin(distances)
clusters.append(cluster)
return np.array(clusters)

# Function to update centroids


def update_centroids(data, clusters, k):
new_centroids = []
for i in range(k):
cluster_points = data[clusters == i]
if len(cluster_points) > 0:
new_centroid = cluster_points.mean(axis=0)
else:
new_centroid = data[np.random.choice(data.shape[0])]
new_centroids.append(new_centroid)
return np.array(new_centroids)

# K-means algorithm
def k_means(data, k, max_iterations=100, tolerance=1e-4):
centroids = initialize_centroids(data, k)
for i in range(max_iterations):
clusters = assign_clusters(data, centroids)
new_centroids = update_centroids(data, clusters, k)

# Check for convergence


diff = np.linalg.norm(new_centroids - centroids)
if diff < tolerance:
print(f'Converged after {i} iterations.')
break
centroids = new_centroids
return centroids, clusters

# Parameters
k = 3 # Number of clusters

# Run K-means
centroids, clusters = k_means(data, k)

# Plotting the results


def plot_clusters(data, centroids, clusters):
colors = ['r', 'g', 'b']
for i in range(k):

44
cluster_points = data[clusters == i]
plt.scatter(cluster_points[:, 0], cluster_points[:, 1], s=30,
color=colors[i], label=f'Cluster {i+1}')
plt.scatter(centroids[:, 0], centroids[:, 1], s=200, color='black',
marker='X', label='Centroids')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend()
plt.show()

plot_clusters(data, centroids, clusters)

45

You might also like