0% found this document useful (0 votes)
6 views26 pages

ML Lab

This document provides an overview of various machine learning algorithms including FIND-S, Candidate Elimination, ID3 Decision Tree, Artificial Neural Networks, Naïve Bayes, Bayesian Networks, K-Means, and KNN. Each algorithm is explained with sample datasets and Python code implementations, demonstrating their applications in classification, regression, and clustering tasks. Additionally, it includes performance evaluation metrics and comparisons between different algorithms.

Uploaded by

Rupesh Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views26 pages

ML Lab

This document provides an overview of various machine learning algorithms including FIND-S, Candidate Elimination, ID3 Decision Tree, Artificial Neural Networks, Naïve Bayes, Bayesian Networks, K-Means, and KNN. Each algorithm is explained with sample datasets and Python code implementations, demonstrating their applications in classification, regression, and clustering tasks. Additionally, it includes performance evaluation metrics and comparisons between different algorithms.

Uploaded by

Rupesh Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

FIND-S Algorithm Demonstration

This document explains and demonstrates the FIND-S algorithm, which is used to find the
most specific hypothesis that fits all the positive examples in a given dataset.

Step 1: Sample CSV Format


The following is a sample of training data stored in a CSV file named 'training_data.csv':

Sky AirTemp Humidity Wind Water Forecast EnjoySport

Sunny Warm Normal Strong Warm Same Yes

Sunny Warm High Strong Warm Same Yes

Rainy Cold High Strong Warm Change No

Sunny Warm High Strong Cool Change Yes

Step 2: Python Code Implementation

import pandas as pd

def find_s_algorithm(df):
# Get only positive examples
positive_examples = df[df.iloc[:, -1] == "Yes"]

# Initialize hypothesis with the first positive example


hypothesis = positive_examples.iloc[0, :-1].tolist()

for _, row in positive_examples.iterrows():


for i in range(len(hypothesis)):
if hypothesis[i] != row[i]:
hypothesis[i] = "?"

return hypothesis

# Load data from CSV file


file_path = "training_data.csv" # Update with your actual path
data = pd.read_csv(file_path)

# Apply FIND-S
final_hypothesis = find_s_algorithm(data)

print("Final hypothesis:", final_hypothesis)

Step 3: Output Example


Final hypothesis: ['Sunny', 'Warm', '?', 'Strong', '?', '?']

This indicates the most specific hypothesis that fits all positive examples.
Candidate Elimination Algorithm -
Detailed Explanation

Overview:

The Candidate-Elimination algorithm is a supervised learning algorithm used in Concept


Learning. It aims to find all hypotheses consistent with a given training dataset. It maintains a
Version Space, defined by the most specific hypothesis (S) and the most general hypothesis
(G).

Sample Training Data:


The following dataset is used for demonstration:

Sky AirTemp Humidity Wind Water Forecast EnjoySport

Sunny Warm Normal Strong Warm Same Yes

Sunny Warm High Strong Warm Same Yes

Rainy Cold High Strong Warm Change No

Sunny Warm High Strong Cool Change Yes

Algorithm Steps:

1. Initialize S to the first positive example.


2. Initialize G to the most general hypothesis: all fields are '?'
3. For each training example:
- If the example is positive, generalize S minimally to include it.
Remove hypotheses from G that are inconsistent with S.
- If the example is negative, specialize G to exclude it.
Ensure each new G hypothesis is consistent with S.

Final Output:
After processing all the examples, we obtain the following:

Specific Hypothesis (S): ['Sunny', 'Warm', '?', 'Strong', '?', '?']

General Hypotheses (G):

[['Sunny', '?', '?', '?', '?', '?'], ['?', 'Warm', '?', '?', '?', '?']]

Explanation:

- The Specific Hypothesis (S) is the most specific hypothesis that covers all positive
examples.
- The General Hypotheses (G) are the most general boundaries that are consistent with all
positive and negative examples.
- Together, S and G represent the version space of all consistent hypotheses.
ID3 Decision Tree Algorithm -
Implementation & Demonstration

Overview:

The ID3 (Iterative Dichotomiser 3) algorithm is a decision tree learning algorithm used for
classification.
It builds the tree by choosing the attribute that yields the highest information gain at each
step.
It is typically used with categorical data.

Dataset:

The following dataset is used to demonstrate the ID3 algorithm (Play Tennis dataset):

Outlook Temperature Humidity Wind PlayTennis

Sunny Hot High Weak No

Sunny Hot High Strong No

Overcast Hot High Weak Yes

Rain Mild High Weak Yes

Rain Cool Normal Weak Yes

Rain Cool Normal Strong No


Overcast Cool Normal Strong Yes

Sunny Mild High Weak No

Sunny Cool Normal Weak Yes

Rain Mild Normal Weak Yes

Sunny Mild Normal Strong Yes

Overcast Mild High Strong Yes

Overcast Hot Normal Weak Yes

Rain Mild High Strong No

Python Code:

import pandas as pd
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.preprocessing import LabelEncoder
import matplotlib.pyplot as plt

# Load dataset
data = pd.read_csv("play_tennis.csv")

# Encode categorical values


le = LabelEncoder()
for column in data.columns:
data[column] = le.fit_transform(data[column])

# Train model
X = data.drop("PlayTennis", axis=1)
y = data["PlayTennis"]
clf = DecisionTreeClassifier(criterion="entropy")
clf.fit(X, y)

# Visualize tree
plot_tree(clf, feature_names=X.columns, class_names=["No", "Yes"], filled=True)
plt.show()

# Predict a new sample


sample = pd.DataFrame([["Sunny", "Cool", "High", "Strong"]], columns=["Outlook",
"Temperature", "Humidity", "Wind"])
sample_encoded = sample.apply(lambda col: le.fit(data[col.name]).transform(col))
print("Prediction:", clf.predict(sample_encoded))

Explanation:

The above code uses the ID3 algorithm via sklearn's DecisionTreeClassifier with entropy as
the criterion.
It trains a model on the Play Tennis dataset, visualizes the decision tree, and predicts the
output for a new sample.
Artificial Neural Network using
Backpropagation - Implementation
Guide

Overview:

An Artificial Neural Network (ANN) is a computational model inspired by the human brain.
The backpropagation algorithm is used to train the ANN by minimizing the error between the
predicted and actual output through gradient descent.

Dataset:

We use a simple dataset such as the XOR function to demonstrate the working of ANN with
backpropagation.

Input1 Input2 Output

0 0 0

0 1 1

1 0 1

1 1 0
Python Code (using Keras):

import numpy as np
from keras.models import Sequential
from keras.layers import Dense

# XOR dataset
X = np.array([[0,0],[0,1],[1,0],[1,1]])
y = np.array([[0],[1],[1],[0]])

# Build ANN model


model = Sequential()
model.add(Dense(4, input_dim=2, activation='relu'))
model.add(Dense(1, activation='sigmoid'))

# Compile model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# Train model
model.fit(X, y, epochs=1000, verbose=0)

# Test predictions
predictions = model.predict(X)
print("Predictions:\n", predictions)

Explanation:

The model consists of an input layer, one hidden layer with 4 neurons, and an output layer.
The activation function 'relu' is used in the hidden layer and 'sigmoid' in the output layer.
Backpropagation is used internally by Keras to adjust weights and minimize error using the
'adam' optimizer.
Naïve Bayes Classifier -
Implementation & Accuracy Evaluation

Overview:

The Naïve Bayes classifier is a probabilistic classifier based on Bayes’ Theorem with strong
independence assumptions between features.
It is simple, fast, and effective for many classification problems.

Dataset:

We assume a CSV file containing labeled data for training and testing. For demonstration,
let’s consider a Play Tennis dataset.

Outlook Temperature Humidity Wind PlayTennis

Sunny Hot High Weak No

Sunny Hot High Strong No

Overcast Hot High Weak Yes

Rain Mild High Weak Yes

Rain Cool Normal Weak Yes

Rain Cool Normal Strong No

Overcast Cool Normal Strong Yes


Sunny Mild High Weak No

Sunny Cool Normal Weak Yes

Rain Mild Normal Weak Yes

Python Code:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import CategoricalNB
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score

# Load dataset
data = pd.read_csv("play_tennis.csv")

# Encode categorical data


label_encoders = {}
for col in data.columns:
le = LabelEncoder()
data[col] = le.fit_transform(data[col])
label_encoders[col] = le

# Split into train and test sets


X = data.drop("PlayTennis", axis=1)
y = data["PlayTennis"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train Naive Bayes model


model = CategoricalNB()
model.fit(X_train, y_train)

# Predict and evaluate


predictions = model.predict(X_test)
accuracy = accuracy_score(y_test, predictions)
print("Accuracy:", accuracy)

Explanation:

The code uses sklearn's `CategoricalNB` to train the Naïve Bayes classifier on categorical
data.
The dataset is encoded, split into training and test sets, and the model is trained and evaluated
for accuracy.
Document Classification using Naïve
Bayes in Java

Overview:

This document outlines the implementation of a Naïve Bayes Classifier in Java using the
Weka API to classify a set of documents.
The model computes Accuracy, Precision, and Recall to evaluate its performance.

Assumptions:

- The dataset is in ARFF format (Weka-compatible) and contains labeled document data.
- The Java program uses Weka’s NaiveBayes class.

Java Code:

import weka.core.Instances;
import weka.core.converters.ConverterUtils.DataSource;
import weka.classifiers.bayes.NaiveBayes;
import weka.classifiers.Evaluation;
import java.util.Random;

public class DocumentClassifier {


public static void main(String[] args) throws Exception {
// Load dataset
DataSource source = new DataSource("documents.arff");
Instances data = source.getDataSet();
data.setClassIndex(data.numAttributes() - 1);

// Train Naive Bayes model


NaiveBayes nb = new NaiveBayes();
nb.buildClassifier(data);

// Evaluate with 10-fold cross-validation


Evaluation eval = new Evaluation(data);
eval.crossValidateModel(nb, data, 10, new Random(1));

// Output metrics
System.out.println("Accuracy: " + eval.pctCorrect());
System.out.println("Precision: " + eval.precision(1));
System.out.println("Recall: " + eval.recall(1));
}
}

Explanation:

This Java program loads a document dataset, builds a Naïve Bayes classifier using Weka,
and evaluates its performance with 10-fold cross-validation. It prints Accuracy, Precision, and
Recall for class '1'.
Bayesian Network for Heart Disease
Diagnosis

Overview:

This program demonstrates how to construct and use a Bayesian Network to diagnose heart
disease using a medical dataset.
The model is implemented in Python using the `pgmpy` library.

Dataset:

We use a simplified version of the UCI Heart Disease dataset with attributes like Age,
Gender, ChestPainType, Cholesterol, and HeartDisease.

Python Code (using pgmpy):

import pandas as pd
from pgmpy.models import BayesianNetwork
from pgmpy.estimators import MaximumLikelihoodEstimator
from pgmpy.inference import VariableElimination

# Sample medical dataset


data = pd.DataFrame([
[63, 1, 'typical', 233, 1],
[37, 1, 'non-anginal', 250, 1],
[41, 0, 'atypical', 204, 0],
[56, 1, 'asymptomatic', 236, 1],
[57, 0, 'typical', 354, 0],
], columns=["Age", "Sex", "ChestPainType", "Cholesterol", "HeartDisease"])

# Convert categorical values


data["ChestPainType"] = data["ChestPainType"].astype('category').cat.codes

# Define Bayesian Network structure


model = BayesianNetwork([
("Age", "HeartDisease"),
("Sex", "HeartDisease"),
("ChestPainType", "HeartDisease"),
("Cholesterol", "HeartDisease")
])

# Learn CPDs using Maximum Likelihood Estimator


model.fit(data, estimator=MaximumLikelihoodEstimator)

# Perform inference
inference = VariableElimination(model)
result = inference.query(variables=["HeartDisease"], evidence={"Age": 56, "Sex": 1,
"ChestPainType": 3, "Cholesterol": 236})

print(result)
Explanation:

This code builds a Bayesian Network with connections from predictors to the target variable
(HeartDisease).
The `MaximumLikelihoodEstimator` learns the CPDs from the data.
The `VariableElimination` module performs inference to diagnose if a patient with given
attributes has heart disease.
Clustering with EM and K-Means
Algorithms - Implementation &
Comparison

Overview:

This program applies the Expectation-Maximization (EM) algorithm and the K-Means
algorithm to cluster a dataset stored in a CSV file.
It uses Python's scikit-learn library for implementation and compares the results using
silhouette score for clustering quality.

Python Code:

import pandas as pd
from sklearn.cluster import KMeans
from sklearn.mixture import GaussianMixture
from sklearn.metrics import silhouette_score
from sklearn.preprocessing import StandardScaler

# Load dataset
data = pd.read_csv("data.csv")
X = StandardScaler().fit_transform(data)

# K-Means Clustering
kmeans = KMeans(n_clusters=3, random_state=42)
kmeans_labels = kmeans.fit_predict(X)
kmeans_score = silhouette_score(X, kmeans_labels)
print("K-Means Silhouette Score:", kmeans_score)

# EM Clustering using Gaussian Mixture Model


gmm = GaussianMixture(n_components=3, random_state=42)
gmm_labels = gmm.fit_predict(X)
gmm_score = silhouette_score(X, gmm_labels)
print("EM Silhouette Score:", gmm_score)

Explanation:

Both K-Means and EM algorithms aim to group similar data points together:
- K-Means partitions the dataset into clusters by minimizing the sum of squared distances.
- EM uses a probabilistic model (Gaussian Mixture) to estimate the likelihood of data points
belonging to clusters.

The silhouette score measures how well data points fit into their assigned clusters. Higher
scores indicate better-defined clusters.

Comparison and Conclusion:

- If the silhouette score for EM is higher than K-Means, EM provides better clustering,
especially when data clusters are elliptical.
- If K-Means performs similarly or better, it suggests that the clusters are spherical and well-
separated.
- Depending on the data shape, EM may handle overlapping clusters better than K-Means.
K-Nearest Neighbour (KNN)
Algorithm - Iris Dataset Classification

Overview:

The K-Nearest Neighbour (KNN) algorithm is a simple, non-parametric method used for
classification and regression.
This implementation uses Python's scikit-learn library to classify the Iris dataset.
It prints both correct and incorrect predictions.

Python Code (using sklearn):

from sklearn.datasets import load_iris


from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

# Load Iris dataset


iris = load_iris()
X = iris.data
y = iris.target
target_names = iris.target_names

# Split data into train and test sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train KNN classifier


knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)

# Predict and evaluate


predictions = knn.predict(X_test)
for i in range(len(y_test)):
actual = target_names[y_test[i]]
predicted = target_names[predictions[i]]
status = "Correct" if y_test[i] == predictions[i] else "Incorrect"
print(f"Sample {i+1}: Actual = {actual}, Predicted = {predicted} --> {status}")

Explanation:

- The KNN algorithm classifies test samples based on the majority vote of the k-nearest
training examples.
- The Iris dataset contains three classes of 50 instances each, where each class refers to a type
of iris plant.
- The script prints the prediction results for each test sample and indicates whether each
prediction was correct or incorrect.
Locally Weighted Regression (LWR) -
Implementation & Visualization

Overview:

Locally Weighted Regression (LWR) is a non-parametric algorithm used to fit a curve through
data points.
It computes weights for each training point depending on its distance to the query point and
fits a regression using weighted least squares.

Python Code Summary:

- `kernel`: Defines the weight for each training point based on Gaussian kernel.
- `predict`: Computes the regression prediction using weighted least squares.
- Data: 100 points generated from `sin(x)` with added noise.
- Bandwidth (`tau`) determines how 'local' the fit is.
Graphical Output:

The above graph shows the noisy data (red) and the smooth curve fitted using LWR (blue).

You might also like