0% found this document useful (0 votes)

38 views27 pages

ML W8 Merged

Uploaded by

chaitu4064

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

38 views27 pages

ML W8 Merged

Uploaded by

chaitu4064

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 27

Exp No: Page No:

Date:

Aim:
Develop a program for Bias, Variance, Remove duplicates , Cross Validation.

Description:
Bias:
• Bias refers to errors introduced by approximating a real-world problem, which may be
too complex, with a simplified model. High bias leads to underfitting, where the
model is too simple to capture the underlying patterns in the data.
2. Variance:
• Variance is the error caused by the model’s sensitivity to small fluctuations in the
training data. High variance results in overfitting, where the model captures noise
instead of the underlying pattern, making it perform poorly on new, unseen data.
3. Remove Duplicates:
• Removing duplicates involves identifying and eliminating duplicate records from the
dataset. This helps in reducing redundancy, improving the quality of data, and
ensuring that the model is not biased by repeated data points.
4. Cross-Validation:
• Cross-validation is a technique to evaluate the performance of a model by dividing the
dataset into multiple subsets, training the model on some subsets, and testing it on the
remaining ones. It helps in assessing how well the model generalizes to unseen data,
reducing the risk of overfitting.

Program:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, accuracy_score,precision_score, recall_score
import matplotlib.pyplot as plt
import seaborn as sns

# Load the iris dataset

ADITYA ENGINEERING COLLEGE(A) Roll No: 22A91A61

Exp No: Page No:
Date:
from sklearn.datasets import load_iris
iris = load_iris()
# Convert the dataset into a DataFrame
df = pd.DataFrame(data=np.c_[iris['data'], iris['target']],
columns=iris['feature_names']+['target'])
df['target'] = df['target'].astype(int)
df

df.info()
# Remove duplicates
df.drop_duplicates(inplace=True)
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(df[iris['feature_names']],df ['target'],
test_size= 0.4, random_state=0)
# Fit a Random Forest Classifier model
model = RandomForestClassifier()
model.fit(X_train, y_train)

# Calculate Bias and Variance

train_error = 1 - model.score(X_train, y_train)
test_error = 1 - model.score(X_test, y_test)
bias = np.mean([train_error, test_error])
variance = model.score(X_test, y_test) - train_error

print("Bias: {:.2f}".format(bias))
print("Variance: {:.2f}".format(variance))
# Cross-validation
cv_scores = cross_val_score(model, df[iris['feature_names']], df['target'], cv=5)
print("\nCross-validation scores: ", cv_scores)
print("Mean CV Accuracy: {:.2f}".format(np.mean(cv_scores)))

ADITYA ENGINEERING COLLEGE(A) Roll No: 22A91A61

Exp No: Page No:
Date:
# Confusion matrix
y_pred = model.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(6, 3))
sns.heatmap(cm, annot=True, cmap='Blues')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='weighted')
recall = recall_score(y_test, y_pred, average='weighted')
print("\nAccuracy: {:.2f}".format(accuracy))
print("Precision: {:.2f}".format(precision))
print("Recall: {:.2f}".format(recall))

Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 sepal length (cm) 150 non-null float64
1 sepal width (cm) 150 non-null float64
2 petal length (cm) 150 non-null float64
3 petal width (cm) 150 non-null float64
4 target 150 non-null int64
dtypes: float64(4), int64(1)
memory usage: 6.0 KB
Bias: 0.02
Variance: 0.97

ADITYA ENGINEERING COLLEGE(A) Roll No: 22A91A61

Exp No: Page No:
Date:
Cross-validation scores: [0.96666667 0.96666667 0.93333333 0.96666667 1. ]
Mean CV Accuracy: 0.97

Accuracy: 0.97
Precision: 0.97
Recall: 0.97

Viva Questions:
1.What is bias in machine learning?
Ans : Bias is the error introduced by approximating a real-world problem with a simplified
model.
2.What happens when a model has high bias?
Ans : It underfits the data, leading to poor performance on both training and test sets.
3.What is variance in machine learning?
Ans : Variance is the model’s sensitivity to small fluctuations in the training data.
4.How does variance affect model generalization?
Ans : High variance leads to poor generalization to unseen data.
5.How can you remove duplicates in a dataset?
Ans : By using functions like drop_duplicates() in pandas for Python.
6.What is k-fold cross-validation?
Ans : A method where data is split into k subsets, and the model is trained k times, each time
using a different subset as validation.

ADITYA ENGINEERING COLLEGE(A) Roll No: 22A91A61

Exp No: Page No:
Date:

Aim:
Consider Patient Dataset. Apply linear classification technique(SVM) to identify the rate of
social networks ads.

Description:
A Support Vector Machine (SVM) is a powerful machine learning algorithm widely used for
both linear and nonlinear classification, as well as regression and outlier detection tasks.
SVMs are highly adaptable, making them suitable for various applications such as text
classification, image classification, spam detection, handwriting identification, gene
expression analysis, face detection, and anomaly detection.

The dimension of the hyperplane depends on the number of features. For instance, if there are
two input features, the hyperplane is simply a line, and if there are three input features, the
hyperplane becomes a 2-D plane. As the number of features increases beyond three, the
complexity of visualizing the hyperplane also increases.

Types of Support Vector Machine:

Based on the nature of the decision boundary, Support Vector Machines (SVM) can be
divided into two main parts:
• Linear SVM: Linear SVMs use a linear decision boundary to separate the data points
of different classes. When the data can be precisely linearly separated, linear SVMs
are very suitable.
• Non-Linear SVM: Non-Linear SVM can be used to classify data when it cannot be
separated into two classes by a straight line (in the case of 2D). By using kernel
functions, nonlinear SVMs can handle nonlinearly separable data.A linear SVM is
used to locate a nonlinear decision boundary in this modified space.

SVM Algorithm in Python

Predict if cancer is Benign or malignant. Using historical data about patients diagnosed with
cancer enables doctors to differentiate malignant cases and benign ones are given
independent attributes.
Steps
• Load the breast cancer dataset from sklearn.datasets
• Separate input features and target variables.
• Build and train the SVM classifiers using RBF kernel.
• Plot the scatter plot of the input features.
• Plot the decision boundary.

ADITYA ENGINEERING COLLEGE(A) Roll No: 22A91A61

Exp No: Page No:
Date:
• Plot the decision boundary

Dataset:

Program:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv("Social_Network_Ads.csv")
df.head()
df.shape
df.info()
X = df.iloc[:,[2,3]]
X
Y = df.iloc[:,4]
Y
from sklearn.model_selection import train_test_split
X_Train,X_Test,Y_Train,Y_Test = train_test_split(X,Y,test_size = 0.25,random_state =0)
X_Train.shape
X_Test.shape
X_Train.describe()
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform (X_Train)

ADITYA ENGINEERING COLLEGE(A) Roll No: 22A91A61

Exp No: Page No:
Date:
X_test = sc_X.transform(X_Test)

from sklearn import svm

from sklearn.svm import SVC
cls = svm.SVC(kernel = "linear")
import pylab as p1
np.random.seed(0)
X = np.r_[np.random.randn(50, 2) - [2, 2], np.random.randn(50, 2) + [2, 2]]
Y = [0] * 50 + [1] * 50
cls.fit(X_train,Y_Train)
pred = cls.predict(X_test)
pred
plt.scatter(X_Train.iloc[:, 0], X_Train.iloc[:, 1],c=Y_Train)
plt.xlabel('Age')
plt.ylabel('Estimated Salary')
plt.title('training Data')
plt.show()
plt.scatter(X_Test.iloc[:, 0], X_Test.iloc[:, 1],c=Y_Test)
plt.xlabel('Age')
plt.ylabel('Estimated Salary')
plt.title('training Data')
plt.show()
w = cls.coef_[0]
a = -w[0] / w[1]
xx = np.linspace(-2,2)
yy = a * xx - (cls.intercept_[0] / w[1])
b = cls.support_vectors_[0]
yy_down = a * xx + (b[1] - a * b[0])
b = cls.support_vectors_[-1]
yy_up = a * xx + (b[1] - a * b[0])

ADITYA ENGINEERING COLLEGE(A) Roll No: 22A91A61

Exp No: Page No:
Date:
import pylab as p1
p1.set_cmap(p1.cm.Paired)
p1.plot(xx, yy, 'k-')
p1.plot(xx, yy_down, 'k--')
p1.plot(xx, yy_up, 'k--')

p1.scatter(cls.support_vectors_[:, 0], cls.support_vectors_[:, 1],

s=80, facecolors='none')
p1.scatter(X[:, 0], X[:, 1], c=Y)

p1.axis('tight')
p1.show()

from sklearn import metrics

print("accuracy:", metrics.accuracy_score(Y_Test,y_pred = pred))

print("precision:", metrics.precision_score(Y_Test,y_pred = pred))

print("recall:", metrics.recall_score(Y_Test,y_pred = pred))

cls = svm.SVC(kernel = "rbf")
cls.fit(X_train,Y_Train)
pred = cls.predict(X_test)
pred
from sklearn import metrics
print("accuracy:", metrics.accuracy_score(Y_Test,y_pred = pred))

print("precision:", metrics.precision_score(Y_Test,y_pred = pred))

print("recall:", metrics.recall_score(Y_Test,y_pred = pred))

print(metrics.classification_report(Y_Test,y_pred = pred))

ADITYA ENGINEERING COLLEGE(A) Roll No: 22A91A61

Exp No: Page No:
Date:
cls = svm.SVC(kernel = "rbf",gamma = 15, C=7,random_state =0)
cls.fit(X_train,Y_Train)
pred = cls.predict(X_test)
pred
from sklearn import metrics
print("accuracy:", metrics.accuracy_score(Y_Test,y_pred = pred))
from sklearn import metrics
print("precision:", metrics.precision_score(Y_Test,y_pred = pred))
from sklearn import metrics
print("recall:", metrics.recall_score(Y_Test,y_pred = pred))
cls = svm.SVC(kernel = "poly",degree = 4)
cls.fit(X_train,Y_Train)
pred = cls.predict(X_test)
pred
from sklearn import metrics
print("accuracy:", metrics.accuracy_score(Y_Test,y_pred = pred))
from sklearn import metrics
print("precision:", metrics.precision_score(Y_Test,y_pred = pred))
from sklearn import metrics
print("recall:", metrics.recall_score(Y_Test,y_pred = pred))

ADITYA ENGINEERING COLLEGE(A) Roll No: 22A91A61

Exp No: Page No:
Date:

Output:

ADITYA ENGINEERING COLLEGE(A) Roll No: 22A91A61

Exp No: Page No:
Date:

ADITYA ENGINEERING COLLEGE(A) Roll No: 22A91A61

Exp No: Page No:
Date:

Viva Questions:
1.What is the main objective of SVM?
Ans : To find the optimal hyperplane that maximizes the margin between different classes.
2. What is a hyperplane in SVM?
Ans : A decision boundary that separates classes in the feature space.
3.What is the margin in SVM?
Ans : The distance between the closest data points of each class and the hyperplane.
4.What is a support vector?
Ans : Data points closest to the hyperplane, which influence its position and orientation.
5. What is a kernel in SVM?
Ans : A function that transforms data into a higher-dimensional space to make it linearly
separable.

ADITYA ENGINEERING COLLEGE(A) Roll No: 22A91A61

Exp No: Page No:
Date:

Aim: Write a program to implement k-Nearest Neighbor algorithm to classify the iris data
set. Print both correct and wrong predictions
Description:
The K-nearest neighbors (KNN) algorithm is a supervised machine learning algorithm that
classifies data points based on the class of their closest neighbors:

How it works:\

The KNN algorithm classifies new data points by looking at the labels of the closest
neighbors in the training dataset's feature space. The algorithm is based on the principle of
"information gain," which means it finds the most suitable way to predict an unknown value.

When to use it:

KNN is useful when labeled data is expensive or hard to get, and it can be used for a wide
variety of prediction problems. It's also used in many areas, including image recognition,
handwriting detection, and video recognition.

Dataset:
{'data': array([[5.1, 3.5, 1.4, 0.2],

[4.9, 3. , 1.4, 0.2],

[4.7, 3.2, 1.3, 0.2],

[4.6, 3.1, 1.5, 0.2],

[5. , 3.6, 1.4, 0.2],

[5.4, 3.9, 1.7, 0.4],

………………………………………………..

[6.3, 2.5, 5. , 1.9],

[6.5, 3. , 5.2, 2. ],

[6.2, 3.4, 5.4, 2.3],

[5.9, 3. , 5.1, 1.8]]),

'target': array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,

…………………………………………………….,

2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

ADITYA ENGINEERING COLLEGE(A) Roll No: 22A91A61

Exp No: Page No:
Date:

Program:
from sklearn.datasets import load_iris

from sklearn.model_selection import

train_test_split from sklearn.neighbors

import KNeighborsClassifier from

sklearn.metrics import accuracy_score iris

= load_iris() iris

X =

iris.data y

iris.target

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size =

0.4,random_state = 25) knn = KNeighborsClassifier(n_neighbors= 3)

knn.fit(X_train,y_train) y_pred = knn.predict(X_test) accuracy =

accuracy_score(y_test,y_pred) print("Accuracy",accuracy)

Output:
KNeighborsClassifier

KNeighborsClassifier(n_neighbors=3)

Accuracy 0.9666666666666667

Viva Questions:
1. What is K-Nearest Neighbors (KNN)?
Ans: KNN is a supervised learning algorithm used for both classification and regression. It
works by finding the 'k' closest data points (neighbors) to a query point and predicting the
class or value based on the majority label or average of these neighbors.

2. How does KNN work?

Ans: KNN identifies the 'k' nearest data points to a given input based on a distance metric
(like Euclidean distance).

ADITYA ENGINEERING COLLEGE(A) Roll No: 22A91A61

Exp No: Page No:
Date:

3. What is the role of 'k' in KNN?

Ans: 'k' is the number of nearest neighbors to consider. A smaller 'k' can lead to overfitting,
while a larger 'k' can result in underfitting.

4. What are the advantages of KNN?

Ans: It’s simple and easy to implement.

No training phase; it's a lazy learner.

Effective in low-dimensional spaces.

5. What are the disadvantages of KNN?

Ans: Computationally expensive during prediction as it needs to calculate distances for all
data points.

It is sensitive to the scale of the data, requiring normalization.

It struggles with large datasets and high-dimensional spaces.

ADITYA ENGINEERING COLLEGE(A) Roll No: 22A91A61

Exp No: Page No:
Date:

EXPERIMENT-5
Aim: To solve the real-world problems using Logistic Regression
Description:
Logistic regression is a machine learning algorithm that uses a statistical method to
predict the probability of a binary outcome based on independent variables:
Purpose
Logistic regression is used to classify data into categories and understand the
relationship between variables.
How it works
Logistic regression uses a sigmoid function to model the relationship between
variables and output a probability between 0 and 1.
When to use it
Logistic regression is used when the outcome variable is binary or categorical, such as
yes or no.
Applications
Logistic regression is used in many fields, including medical research and insurance.
For example, researchers can use logistic regression to calculate the risk of cancer by
considering patient habits and genetic predispositions.

Dataset:
Unnamed: 0 Age Sex ChestPain RestBP Chol Fbs RestECG
MaxHR ExAng Oldpeak Slope Ca Thal AHD
0 1 63 1 typical 145 233 1 2 150 0 2.3 3
0.0 fixed No
1 2 67 1 asymptomatic 160 286 0 2 108 1
1.5 2 3.0 normal Yes
2 3 67 1 asymptomatic 120 229 0 2 129 1
2.6 2 2.0 reversable Yes
3 4 37 1 nonanginal 130 250 0 0 187 0 3.5
3 0.0 normal No
4 5 41 0 nontypical 130 204 0 2 172 0 1.4
1 0.0 normal No

ADITYA ENGINEERING COLLEGE(A) Roll No: 22A91A61

Exp No: Page No:
Date:

... ... ... ... ... ... ... ... ... ... ... ... ...
... ... ...
298 299 45 1 typical 110 264 0 0 132 0 1.2 2
0.0 reversable Yes
299 300 68 1 asymptomatic 144 193 1 0 141 0
3.4 2 2.0 reversable Yes
300 301 57 1 asymptomatic 130 131 0 0 115 1
1.2 2 1.0 reversable Yes
301 302 57 0 nontypical 130 236 0 2 174 0 0.0
2 1.0 normal Yes
302 303 38 1 nonanginal 138 175 0 0 173 0 0.0
1 NaN normal No
303 rows × 15 columns
Program:
import pandas as pd
df = pd.read_csv("Heart.csv")
df.info()
df = df.drop(columns = "Unnamed: 0")
df
df['ChestPain'] = df['ChestPain'].astype('category')
df['ChestPain'] = df['ChestPain'].cat.codes
df
df['Thal'] = df['Thal'].astype('category')
df['Thal'] = df['Thal'].cat.codes
df['AHD'] = df['AHD'].astype('category')
df['AHD'] = df['AHD'].cat.codes
df
df.isnull().sum()
df = df.dropna()
df

ADITYA ENGINEERING COLLEGE(A) Roll No: 22A91A61

Exp No: Page No:
Date:

X = df.drop(columns = 'AHD')
X
y = df['AHD']
y
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X, y, test_size =0.3, random_state =
23)
X_train
X_test

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
X_train_scaled
X_test_scaled

from sklearn.linear_model import LogisticRegression

log_reg = LogisticRegression(random_state = 0).fit(X_train_scaled,y_train)
log_reg.predict(X_train_scaled)
log_reg.score(X_train_scaled,y_train)
log_reg.score(X_test_scaled,y_test)
from sklearn.linear_model import Lasso
Lasso_reg = Lasso(alpha = 50 , max_iter = 100,tol = 0.1)
Lasso_reg.fit(X_train_scaled,y_train)
Lasso_reg.score(X_test_scaled,y_test)
Lasso_reg.score(X_test_scaled,y_test)

ADITYA ENGINEERING COLLEGE(A) Roll No: 22A91A61

Exp No: Page No:
Date:

X_train_scaled,y_train
Output:
array([1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0,
…………………………………………………………
0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0,
0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1], dtype=int8)
0.8755980861244019
0.8111111111111111
Lasso
Lasso(alpha=50, max_iter=100, tol=0.1)
-0.0002953787464659019
-0.0002953787464659019

(array([[ 0.77904095, 0.73264228, 0.17796069, ..., 0.57260251,

2.42972109, 1.13611108],
[-0.08933003, 0.73264228, 0.17796069, ..., -1.02304981,
.......................................................................................,
[ 1.43031918, 0.73264228, 0.17796069, ..., 0.57260251,
-0.69848245, 1.13611108]]),
92 0
67 0
..
..
83 1
Name: AHD, Length: 209, dtype: int8)

ADITYA ENGINEERING COLLEGE(A) Roll No: 22A91A61

Exp No: Page No:
Date:

Viva Questions:
1. What is Logistic Regression?
Ans: Logistic regression is a machine learning algorithm that uses a statistical
method to predict the probability of a binary outcome based on independent variables.
2. How does Logistic Regression work?
Ans: Logistic regression uses a sigmoid function to model the relationship between
variables and output a probability between 0 and 1.
3. Why can't we use Linear Regression for classification problems?
Ans: Because its predictions are not restricted to the range [0, 1] and can produce
values that are greater than 1 or less than 0, which don't make sense for probabilities.
4. What is the cost function used in Logistic Regression?
Ans:In Logistic Regression, we use the Log Loss (or Binary Cross-Entropy) as the
cost function, which is defined as:
𝑚
J(θ) = 1/m ∑𝑖=0 [yilog(hθ(xi)) + (1 − yi)log(1 − hθ(xi))]
where yiy_iyi is the actual label, hθ(xi)h_\theta(x_i)hθ(xi) is the predicted probability,
and mmm is the number of training.
5. What is Multinomial Logistic Regression?
Ans: Multinomial Logistic Regression is an extension of binary Logistic Regression
that is used for multi-class classification problems (where the target variable has more
than two categories). It models the probability of each class as a function of the input
variables using a softmax function instead of the sigmoid function.

ADITYA ENGINEERING COLLEGE(A) Roll No: 22A91A61

Exp No: Page No:
Date:

DECISION TREE
Aim: Write a program to demonstrate the working of the decision tree based ID3 algorithm.
Use an appropriate data set for building the decision tree and apply this knowledge to classify a
new sample.
Description: The ID3 (Iterative Dichotomiser 3) algorithm is a supervised learning
algorithm used to create decision trees for classification tasks. It works by recursively
partitioning the dataset based on the feature that provides the highest information gain, which is
a measure of how well a feature separates the data into distinct classes.
• Entropy is a measure of disorder or uncertainty in a dataset. It helps determine the
impurity in a set of examples.
• Information Gain (IG) is a measure of the effectiveness of a feature in classifying the
training data. It quantifies the reduction in entropy after the dataset is split based on a
specific feature.
Dataset:

Advantages of ID3:
• Simple and easy to understand.
• Requires little training data.
• Can work well with data with discrete and continuous attributes.
Disadvantages of ID3:
• Can lead to overfitting.
• May not be effective with data with many attributes.
Applications of ID3:
1.Fraud detection
2.Medical diagnosis
3.Customer segmentation
4.Risk assessment
5.Recommendation systems
Formulas:-
1.Entropy:
A measure of disorder or uncertainty in a set of data is called Entropy.

ADITYA UNIVERSITY Roll No: 22A91A61

Exp No: Page No:
Date:

1. Entropy= -P/P+N log2(P/P+N)-N/P+N log2(N/P+N)

Information Gain:
A measure of how well a certain quality reduces uncertainty is called Information Gain. ID3
splits the data at each stage, choosing the property that maximizes Information Gain.

Gain(S,A)=Entropy(S)- |Sv| /|S| Entropy(Sv)

rDA
Program:
import pandas as pd
import math
#calculate entropy of the whole dataset
def entropy(data):
total_samples = len(data)
if total_samples == 0:
return 0
positive_samples = sum(data['PlayTennis'] == 'Yes')
negative_samples = sum(data['PlayTennis'] == 'No')
p_positive = positive_samples / total_samples
p_negative = negative_samples / total_samples
if p_positive == 0 or p_negative == 0:
return 0
return -p_positive * math.log2(p_positive) - p_negative * math.log2(p_negative)
def information_gain(data, attribute):
total_entropy = entropy(data)
values = data[attribute].unique()
weighted_entropy = 0
for value in values:
subset = data[data[attribute] == value]
subset_entropy = entropy(subset)
subset_weight = len(subset) / len(data)
weighted_entropy += subset_weight * subset_entropy
return total_entropy - weighted_entropy
def id3(data, target_attribute, attributes):
if len(data[target_attribute].unique()) == 1:
return data[target_attribute].iloc[0]
if len(attributes) == 0:
return data[target_attribute].mode()[0]
best_attribute = max(attributes, key=lambda a: information_gain(data, a))
tree = {best_attribute: {}}
remaining_attributes = [a for a in attributes if a != best_attribute]
for value in data[best_attribute].unique():
subset = data[data[best_attribute] == value]
subtree = id3(subset, target_attribute, remaining_attributes)

ADITYA UNIVERSITY Roll No: 22A91A61

Exp No: Page No:
Date:
tree[best_attribute][value] = subtree
return tree
def classify(tree, sample):
if not isinstance(tree, dict):
return tree
attribute = next(iter(tree))
subtree = tree[attribute]
sample_value = sample[attribute]
if sample_value not in subtree:
return None
return classify(subtree[sample_value], sample)
if __name__ == "__main__":
# Load the dataset
data = pd.DataFrame({
'Outlook': ['Sunny', 'Sunny', 'Overcast', 'Rainy', 'Rainy', 'Rainy', 'Overcast', 'Sunny', 'Sunny',
'Rainy', 'Sunny', 'Overcast', 'Overcast', 'Rainy'],
'Temperature': ['Hot', 'Hot', 'Hot', 'Mild', 'Cool', 'Cool', 'Cool', 'Mild', 'Cool', 'Mild', 'Mild',
'Mild', 'Hot', 'Mild'],
'Humidity': ['High', 'High', 'High', 'High', 'Normal', 'Normal', 'Normal', 'High', 'Normal',
'Normal', 'Normal', 'High', 'Normal', 'High'],
'Windy': [False, True, False, False, False, True, True, False, False, False, True, True, False,
True],
'PlayTennis': ['No', 'No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'No']
})
target_attribute = 'PlayTennis'
attributes = ['Outlook', 'Temperature', 'Humidity', 'Windy']
# Build the decision tree
decision_tree = id3(data, target_attribute, attributes)
print("Decision Tree:")
print(decision_tree)
Output:

ADITYA UNIVERSITY Roll No: 22A91A61

Exp No: Page No:
Date:

1. What is the ID3 algorithm used for?

• The ID3 algorithm is used to create a decision tree for classification tasks by finding the
feature that provides the highest information gain at each step.
2. How does the ID3 algorithm calculate the best feature to split on?
• The ID3 algorithm calculates the best feature by computing the information gain of each
feature and selecting the one with the highest information gain.
3. What is entropy in the context of the ID3 algorithm?
• Entropy is a measure of the randomness or impurity in a dataset. The ID3 algorithm uses
entropy to quantify the uncertainty of the target variable.
4. What are the stopping conditions for the ID3 algorithm?
• The stopping conditions for the ID3 algorithm are when all instances in a subset belong
to the same class or when there are no more features to split on.
5. Can the ID3 algorithm handle continuous data?
• The ID3 algorithm is designed for categorical data. However, continuous data can be
handled by discretizing it into intervals before applying the algorithm.

ADITYA UNIVERSITY Roll No: 22A91A61

B24 ML Exp-3
No ratings yet
B24 ML Exp-3
10 pages
Assignment II Machine Learning
No ratings yet
Assignment II Machine Learning
8 pages
T1 ML QB Soln
No ratings yet
T1 ML QB Soln
23 pages
Unit 3 Aam
No ratings yet
Unit 3 Aam
30 pages
UNIT-II-Support Vector Machine Algorithm
No ratings yet
UNIT-II-Support Vector Machine Algorithm
13 pages
Classifying Data Using Support Vector Machines (SVMS) in Python
No ratings yet
Classifying Data Using Support Vector Machines (SVMS) in Python
5 pages
Data Mining Practicals
No ratings yet
Data Mining Practicals
22 pages
SVM Implementation
No ratings yet
SVM Implementation
8 pages
Title: Implement Support Vector Machine Classifier: Department of Computer Science and Engineering
No ratings yet
Title: Implement Support Vector Machine Classifier: Department of Computer Science and Engineering
5 pages
ML5&6&7&8&9&10
No ratings yet
ML5&6&7&8&9&10
35 pages
Lab Program (SVM From Scratch)
No ratings yet
Lab Program (SVM From Scratch)
2 pages
End SEM V IMP DSE 2
No ratings yet
End SEM V IMP DSE 2
9 pages
Machine Learning
No ratings yet
Machine Learning
8 pages
ML - Module 5
No ratings yet
ML - Module 5
80 pages
Lecture 15 - Recap and Midterm Review
No ratings yet
Lecture 15 - Recap and Midterm Review
37 pages
Classification Review
No ratings yet
Classification Review
8 pages
Python
No ratings yet
Python
14 pages
Understanding Support Vector Machine Algorithm From Examples
No ratings yet
Understanding Support Vector Machine Algorithm From Examples
10 pages
Prathamesh KRAI
No ratings yet
Prathamesh KRAI
38 pages
Support Vector Machine For Classification: Name: Saurav Doke Roll No: A-41 PRN: 2264191242040
No ratings yet
Support Vector Machine For Classification: Name: Saurav Doke Roll No: A-41 PRN: 2264191242040
3 pages
Machine Learning General: Definiton
No ratings yet
Machine Learning General: Definiton
14 pages
Unit2 ML Programs
No ratings yet
Unit2 ML Programs
7 pages
Model Generalization
No ratings yet
Model Generalization
117 pages
ML11 Generalization
No ratings yet
ML11 Generalization
40 pages
Maxbox - Starter67 Machine Learning
No ratings yet
Maxbox - Starter67 Machine Learning
7 pages
ML Lab6
No ratings yet
ML Lab6
4 pages
Udacity Machine Learning Analysis Supervised Learning
100% (1)
Udacity Machine Learning Analysis Supervised Learning
504 pages
1
No ratings yet
1
13 pages
Deep Learning Unit 3
No ratings yet
Deep Learning Unit 3
19 pages
SVM K NN MLP With Sklearn Jupyter NoteBo
No ratings yet
SVM K NN MLP With Sklearn Jupyter NoteBo
22 pages
PML Lab Exp 10
No ratings yet
PML Lab Exp 10
3 pages
3 SVM - Jupyter Notebook
No ratings yet
3 SVM - Jupyter Notebook
4 pages
SVM, Neural Network and Random Forest in R
No ratings yet
SVM, Neural Network and Random Forest in R
45 pages
Svmdoc
No ratings yet
Svmdoc
7 pages
Xiiaiuniticapstone Projectpartii
No ratings yet
Xiiaiuniticapstone Projectpartii
11 pages
08 Classification
No ratings yet
08 Classification
46 pages
ML Unit 4 Trupesh Patel
No ratings yet
ML Unit 4 Trupesh Patel
56 pages
ML Fundamentals
No ratings yet
ML Fundamentals
15 pages
DSM MOd 5
No ratings yet
DSM MOd 5
34 pages
Aiml Ex 4-7
No ratings yet
Aiml Ex 4-7
8 pages
AI ML - Cycle 2 Programs
No ratings yet
AI ML - Cycle 2 Programs
15 pages
SML
No ratings yet
SML
8 pages
ML Prac1-10
No ratings yet
ML Prac1-10
32 pages
SVM7
No ratings yet
SVM7
53 pages
Section 1: Cross-Validation and Model Performance
No ratings yet
Section 1: Cross-Validation and Model Performance
33 pages
ML External Xerox
No ratings yet
ML External Xerox
1 page
ML Lab Record - 250625 - 105014
No ratings yet
ML Lab Record - 250625 - 105014
29 pages
Evaluating Machine Learning Algorithms and Model Selection
No ratings yet
Evaluating Machine Learning Algorithms and Model Selection
10 pages
5 DL
No ratings yet
5 DL
33 pages
Machine Learning: Lecture 13: Model Validation Techniques, Overfitting, Underfitting
100% (2)
Machine Learning: Lecture 13: Model Validation Techniques, Overfitting, Underfitting
26 pages
15-The Bias - Variance - Trade-Off-08-04-2024
No ratings yet
15-The Bias - Variance - Trade-Off-08-04-2024
23 pages
Week 7 Laboratory Activity
No ratings yet
Week 7 Laboratory Activity
12 pages
SVM Presentation
No ratings yet
SVM Presentation
27 pages
Prediction On Iris
No ratings yet
Prediction On Iris
14 pages
17 Ensemble Techniques Problem Statement
No ratings yet
17 Ensemble Techniques Problem Statement
28 pages
Aiml 5-8
No ratings yet
Aiml 5-8
19 pages
1 Analytical Part (3 Percent Grade) : + + + 1 N I: y +1 I 1 N I: y 1 I
No ratings yet
1 Analytical Part (3 Percent Grade) : + + + 1 N I: y +1 I 1 N I: y 1 I
5 pages
Session 19 - SVM
No ratings yet
Session 19 - SVM
21 pages
30 Days of Interview Preparation
100% (1)
30 Days of Interview Preparation
415 pages
Lecture - 3 Classification (Decision Tree)
No ratings yet
Lecture - 3 Classification (Decision Tree)
44 pages
Decision Tree
0% (1)
Decision Tree
24 pages
Sales Analysis of E-Commerce Websites Using Data M
No ratings yet
Sales Analysis of E-Commerce Websites Using Data M
6 pages
ID3 Algorithm
100% (1)
ID3 Algorithm
3 pages
Short Answer Type Questions: Question Bank
No ratings yet
Short Answer Type Questions: Question Bank
26 pages
Mastering Machine Learning With Scikit-Learn: Chapter No. 5 "Nonlinear Classification and Regression With Decision Trees"
No ratings yet
Mastering Machine Learning With Scikit-Learn: Chapter No. 5 "Nonlinear Classification and Regression With Decision Trees"
23 pages
Dmbi Unit-3
No ratings yet
Dmbi Unit-3
21 pages
2024 Decision Trees
No ratings yet
2024 Decision Trees
28 pages
UNIT - 5 - ID3 Algotithm (Good Slide)
No ratings yet
UNIT - 5 - ID3 Algotithm (Good Slide)
28 pages
Prediction of Diseases in Smart Health Care System Using Machine Learning
No ratings yet
Prediction of Diseases in Smart Health Care System Using Machine Learning
5 pages
LINFO2262: Decision Trees + Random Forests: Pierre Dupont
No ratings yet
LINFO2262: Decision Trees + Random Forests: Pierre Dupont
43 pages
INT354 Question Bank
No ratings yet
INT354 Question Bank
11 pages
Building Multi-Way Decision Trees With Numerical Attributes
No ratings yet
Building Multi-Way Decision Trees With Numerical Attributes
18 pages
Machine Learning Revision Notes
No ratings yet
Machine Learning Revision Notes
6 pages
Handling Missing Value in Decision Tree Algorithm PDF
No ratings yet
Handling Missing Value in Decision Tree Algorithm PDF
6 pages
Unit 5 - DA - Classification & Clustering
No ratings yet
Unit 5 - DA - Classification & Clustering
105 pages
Decision Tree ID3 CART
No ratings yet
Decision Tree ID3 CART
28 pages
Learning Set of Rules
No ratings yet
Learning Set of Rules
11 pages
Classification - Decision Trees
No ratings yet
Classification - Decision Trees
43 pages
Ecs 403 ML Module I
No ratings yet
Ecs 403 ML Module I
33 pages
Decision Trees - 2022
No ratings yet
Decision Trees - 2022
49 pages
ML-3-Decision Tree
No ratings yet
ML-3-Decision Tree
17 pages
ML Unit2
No ratings yet
ML Unit2
22 pages
U4 ML Updated
No ratings yet
U4 ML Updated
32 pages
Data Mining Unit 3
No ratings yet
Data Mining Unit 3
50 pages
Data Science Interview Questions
100% (1)
Data Science Interview Questions
300 pages
Decision Tree
No ratings yet
Decision Tree
74 pages
DWDM Notes Unit-4
No ratings yet
DWDM Notes Unit-4
89 pages
Appendices
No ratings yet
Appendices
6 pages