ML Lab
ML Lab
This document explains and demonstrates the FIND-S algorithm, which is used to find the
most specific hypothesis that fits all the positive examples in a given dataset.
import pandas as pd
def find_s_algorithm(df):
# Get only positive examples
positive_examples = df[df.iloc[:, -1] == "Yes"]
return hypothesis
# Apply FIND-S
final_hypothesis = find_s_algorithm(data)
This indicates the most specific hypothesis that fits all positive examples.
Candidate Elimination Algorithm -
Detailed Explanation
Overview:
Algorithm Steps:
Final Output:
After processing all the examples, we obtain the following:
[['Sunny', '?', '?', '?', '?', '?'], ['?', 'Warm', '?', '?', '?', '?']]
Explanation:
- The Specific Hypothesis (S) is the most specific hypothesis that covers all positive
examples.
- The General Hypotheses (G) are the most general boundaries that are consistent with all
positive and negative examples.
- Together, S and G represent the version space of all consistent hypotheses.
ID3 Decision Tree Algorithm -
Implementation & Demonstration
Overview:
The ID3 (Iterative Dichotomiser 3) algorithm is a decision tree learning algorithm used for
classification.
It builds the tree by choosing the attribute that yields the highest information gain at each
step.
It is typically used with categorical data.
Dataset:
The following dataset is used to demonstrate the ID3 algorithm (Play Tennis dataset):
Python Code:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.preprocessing import LabelEncoder
import matplotlib.pyplot as plt
# Load dataset
data = pd.read_csv("play_tennis.csv")
# Train model
X = data.drop("PlayTennis", axis=1)
y = data["PlayTennis"]
clf = DecisionTreeClassifier(criterion="entropy")
clf.fit(X, y)
# Visualize tree
plot_tree(clf, feature_names=X.columns, class_names=["No", "Yes"], filled=True)
plt.show()
Explanation:
The above code uses the ID3 algorithm via sklearn's DecisionTreeClassifier with entropy as
the criterion.
It trains a model on the Play Tennis dataset, visualizes the decision tree, and predicts the
output for a new sample.
Artificial Neural Network using
Backpropagation - Implementation
Guide
Overview:
An Artificial Neural Network (ANN) is a computational model inspired by the human brain.
The backpropagation algorithm is used to train the ANN by minimizing the error between the
predicted and actual output through gradient descent.
Dataset:
We use a simple dataset such as the XOR function to demonstrate the working of ANN with
backpropagation.
0 0 0
0 1 1
1 0 1
1 1 0
Python Code (using Keras):
import numpy as np
from keras.models import Sequential
from keras.layers import Dense
# XOR dataset
X = np.array([[0,0],[0,1],[1,0],[1,1]])
y = np.array([[0],[1],[1],[0]])
# Compile model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
# Train model
model.fit(X, y, epochs=1000, verbose=0)
# Test predictions
predictions = model.predict(X)
print("Predictions:\n", predictions)
Explanation:
The model consists of an input layer, one hidden layer with 4 neurons, and an output layer.
The activation function 'relu' is used in the hidden layer and 'sigmoid' in the output layer.
Backpropagation is used internally by Keras to adjust weights and minimize error using the
'adam' optimizer.
Naïve Bayes Classifier -
Implementation & Accuracy Evaluation
Overview:
The Naïve Bayes classifier is a probabilistic classifier based on Bayes’ Theorem with strong
independence assumptions between features.
It is simple, fast, and effective for many classification problems.
Dataset:
We assume a CSV file containing labeled data for training and testing. For demonstration,
let’s consider a Play Tennis dataset.
Python Code:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import CategoricalNB
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score
# Load dataset
data = pd.read_csv("play_tennis.csv")
Explanation:
The code uses sklearn's `CategoricalNB` to train the Naïve Bayes classifier on categorical
data.
The dataset is encoded, split into training and test sets, and the model is trained and evaluated
for accuracy.
Document Classification using Naïve
Bayes in Java
Overview:
This document outlines the implementation of a Naïve Bayes Classifier in Java using the
Weka API to classify a set of documents.
The model computes Accuracy, Precision, and Recall to evaluate its performance.
Assumptions:
- The dataset is in ARFF format (Weka-compatible) and contains labeled document data.
- The Java program uses Weka’s NaiveBayes class.
Java Code:
import weka.core.Instances;
import weka.core.converters.ConverterUtils.DataSource;
import weka.classifiers.bayes.NaiveBayes;
import weka.classifiers.Evaluation;
import java.util.Random;
// Output metrics
System.out.println("Accuracy: " + eval.pctCorrect());
System.out.println("Precision: " + eval.precision(1));
System.out.println("Recall: " + eval.recall(1));
}
}
Explanation:
This Java program loads a document dataset, builds a Naïve Bayes classifier using Weka,
and evaluates its performance with 10-fold cross-validation. It prints Accuracy, Precision, and
Recall for class '1'.
Bayesian Network for Heart Disease
Diagnosis
Overview:
This program demonstrates how to construct and use a Bayesian Network to diagnose heart
disease using a medical dataset.
The model is implemented in Python using the `pgmpy` library.
Dataset:
We use a simplified version of the UCI Heart Disease dataset with attributes like Age,
Gender, ChestPainType, Cholesterol, and HeartDisease.
import pandas as pd
from pgmpy.models import BayesianNetwork
from pgmpy.estimators import MaximumLikelihoodEstimator
from pgmpy.inference import VariableElimination
# Perform inference
inference = VariableElimination(model)
result = inference.query(variables=["HeartDisease"], evidence={"Age": 56, "Sex": 1,
"ChestPainType": 3, "Cholesterol": 236})
print(result)
Explanation:
This code builds a Bayesian Network with connections from predictors to the target variable
(HeartDisease).
The `MaximumLikelihoodEstimator` learns the CPDs from the data.
The `VariableElimination` module performs inference to diagnose if a patient with given
attributes has heart disease.
Clustering with EM and K-Means
Algorithms - Implementation &
Comparison
Overview:
This program applies the Expectation-Maximization (EM) algorithm and the K-Means
algorithm to cluster a dataset stored in a CSV file.
It uses Python's scikit-learn library for implementation and compares the results using
silhouette score for clustering quality.
Python Code:
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.mixture import GaussianMixture
from sklearn.metrics import silhouette_score
from sklearn.preprocessing import StandardScaler
# Load dataset
data = pd.read_csv("data.csv")
X = StandardScaler().fit_transform(data)
# K-Means Clustering
kmeans = KMeans(n_clusters=3, random_state=42)
kmeans_labels = kmeans.fit_predict(X)
kmeans_score = silhouette_score(X, kmeans_labels)
print("K-Means Silhouette Score:", kmeans_score)
Explanation:
Both K-Means and EM algorithms aim to group similar data points together:
- K-Means partitions the dataset into clusters by minimizing the sum of squared distances.
- EM uses a probabilistic model (Gaussian Mixture) to estimate the likelihood of data points
belonging to clusters.
The silhouette score measures how well data points fit into their assigned clusters. Higher
scores indicate better-defined clusters.
- If the silhouette score for EM is higher than K-Means, EM provides better clustering,
especially when data clusters are elliptical.
- If K-Means performs similarly or better, it suggests that the clusters are spherical and well-
separated.
- Depending on the data shape, EM may handle overlapping clusters better than K-Means.
K-Nearest Neighbour (KNN)
Algorithm - Iris Dataset Classification
Overview:
The K-Nearest Neighbour (KNN) algorithm is a simple, non-parametric method used for
classification and regression.
This implementation uses Python's scikit-learn library to classify the Iris dataset.
It prints both correct and incorrect predictions.
Explanation:
- The KNN algorithm classifies test samples based on the majority vote of the k-nearest
training examples.
- The Iris dataset contains three classes of 50 instances each, where each class refers to a type
of iris plant.
- The script prints the prediction results for each test sample and indicates whether each
prediction was correct or incorrect.
Locally Weighted Regression (LWR) -
Implementation & Visualization
Overview:
Locally Weighted Regression (LWR) is a non-parametric algorithm used to fit a curve through
data points.
It computes weights for each training point depending on its distance to the query point and
fits a regression using weighted least squares.
- `kernel`: Defines the weight for each training point based on Gaussian kernel.
- `predict`: Computes the regression prediction using weighted least squares.
- Data: 100 points generated from `sin(x)` with added noise.
- Bandwidth (`tau`) determines how 'local' the fit is.
Graphical Output:
The above graph shows the noisy data (red) and the smooth curve fitted using LWR (blue).