Unit 5
Unit 5
1. Classification
Classification is a supervised learning technique where the objective is to categorize data into
predefined classes or labels based on certain input features.
Examples of Classification:
Process of Classification:
1. Training: A classifier is trained on a labeled dataset, meaning the data includes the
correct classification labels.
2. Model Building: The classifier learns patterns and rules in the data to differentiate
between classes.
3. Testing: The classifier's performance is tested on new data (often a test dataset) to
check how accurately it can classify without knowing the labels.
● Decision Trees: Simple tree-based models that split data based on feature values.
● Logistic Regression: A statistical method for binary or multi-class classification.
● Support Vector Machines (SVM): Maximizes the margin between classes by creating a
decision boundary.
● k-Nearest Neighbors (k-NN): Classifies based on the nearest neighbors in the feature
space.
● Naïve Bayes: Based on Bayes' theorem, using probabilities to classify.
2. Prediction
Examples of Prediction:
● Stock Price Forecasting: Predicting the future stock price of a company.
● Weather Forecasting: Predicting the temperature, rainfall, or wind speed.
● Sales Prediction: Estimating future sales for a product based on historical data.
Process of Prediction:
1. Training: A predictive model is trained on a dataset that includes both input features and
the output values.
2. Model Building: The model learns relationships between input features and the target
variable.
3. Testing and Validation: The model's performance is evaluated on unseen data to see
how accurately it can predict new outcomes.
● Linear Regression: Predicts a continuous target by fitting a line through the data points.
● Polynomial Regression: Extends linear regression by fitting a polynomial curve for
more complex relationships.
● Decision Trees and Random Forests: Used for both classification and regression
tasks, creating models based on tree structures.
● Neural Networks: Complex models that capture non-linear relationships for accurate
predictions in large datasets.
Decision Tree Induction is a popular supervised learning technique used for both classification
and prediction tasks. A decision tree is a tree-like model where each internal node represents a
decision based on the value of a specific attribute, each branch represents an outcome of that
decision, and each leaf node represents a class label or predicted outcome.
● Root Node: The starting point of the tree, representing the entire dataset.
● Internal Nodes: Decision points based on certain attributes (features) of the data.
● Branches: Paths that represent the outcome of decisions.
● Leaf Nodes: Terminal nodes that give the final classification label (in classification) or
predicted value (in regression).
1. Attribute Selection: At each step, the algorithm selects the attribute that best splits the
data into separate classes or values. This selection is based on criteria like information
gain, gain ratio, or Gini index.
2. Recursive Partitioning: The dataset is divided recursively based on the selected
attributes, creating branches at each step.
3. Stopping Criteria: The recursion stops when either all instances in a node belong to the
same class, the tree reaches a maximum depth, or a minimum number of instances in a
node is reached.
4. Pruning (optional): After the initial tree is built, pruning can be done to remove branches
that have low predictive power, reducing overfitting.
Key Concepts in Decision Tree Induction
1. Classification Trees: Used when the target variable is categorical, classifying data into
discrete classes (e.g., “spam” or “not spam”).
2. Regression Trees: Used when the target variable is continuous, predicting a real-valued
output (e.g., predicting housing prices).
Advantages and Disadvantages of Decision Trees
Advantages:
Disadvantages:
● Prone to Overfitting: Large decision trees can overfit the training data, capturing noise
instead of patterns.
● Bias Toward Features with More Levels: Attributes with more levels (values) may be
favored in splitting unless properly adjusted.
● Sensitive to Data Variations: Small changes in the dataset can lead to a completely
different tree.
Let's walk through a simple example to understand decision tree induction for classification.
Dataset
Suppose we have a dataset of students with attributes for Studied, Slept Well, and the target
Passed Exam:
Consider a dataset of house prices based on features like Number of Bedrooms and Square
Footage. A decision tree regression model could be built as follows:
Bayes Classification is a probabilistic method based on Bayes' theorem, often used for
classification tasks. The key idea is to calculate the probability that a given instance belongs to
a particular class and then classify the instance to the class with the highest probability. Bayes
classification is particularly useful for text classification tasks, such as spam detection, sentiment
analysis, and document categorization.
The most common implementation of Bayes classification is the Naïve Bayes classifier.
The classifier predicts the class with the highest posterior probability.
Advantages:
● Simple and Fast: Naïve Bayes is easy to implement and efficient in computation.
● Performs Well on Small Datasets: It can achieve good results with small training
datasets.
● Handles High-Dimensional Data: Works well with text data (e.g., spam detection).
Disadvantages:
Note :Bayes classification is a probabilistic approach that leverages Bayes' theorem to classify
data based on prior probabilities and feature likelihoods.
Rule-Based Classification
● Antecedent (IF part): Specifies the conditions under which the rule applies. This is
generally a logical condition on one or more attributes of the data.
● Consequent (THEN part): Specifies the class label assigned to instances that satisfy
the antecedent conditions.
A rule-based classifier may consist of a single rule or multiple rules to cover different situatio
1. Generate Rules: Rules are derived from the training dataset. Rules can be generated
using various approaches, including decision trees, association rule mining, and direct
rule learning techniques.
2. Rule Evaluation: Each rule is evaluated for its accuracy and coverage.
○ Coverage: The proportion of instances that satisfy the antecedent conditions of
the rule.
○ Accuracy: The proportion of instances that satisfy both the antecedent
conditions and the consequent class label.
3. Rule Pruning: To prevent overfitting, redundant or overly specific rules are removed.
This is similar to pruning in decision trees.
4. Conflict Resolution: When multiple rules apply to a single instance, conflict resolution
techniques, like rule ordering or rule weighting, determine which rule to apply. For
example, rules may be ordered by priority or specificity, with more specific rules taking
precedence.
1. Direct Rule-Based Classifiers: Rules are directly extracted from the data, often using
techniques like decision trees or association rule mining (e.g., Apriori).
2. Decision Tree-Based Rules: Decision trees can be converted into a rule set by
extracting paths from the root to each leaf node, with each path becoming an "if-then"
rule.
3. Association Rule-Based Classifiers: These use association rules to classify data by
identifying frequently co-occurring patterns in the data. A common technique for this is to
use the Apriori algorithm to generate rules.
Advantages:
● Interpretability: Rules are intuitive and easy to understand, making the model
transparent.
● Flexibility: Rules can be updated or added as needed without retraining the whole
model.
● Explainability: Provides clear explanations for classifications, useful in domains
requiring clear justification.
Disadvantages:
● Overfitting: Rule-based classifiers can easily overfit the training data if too many rules
are created.
● Scalability: Managing a large number of rules can be computationally expensive and
complex.
● Conflict Handling: When multiple rules apply, determining which rule to use can be
challenging.
Suppose we have a dataset of customer information, and we want to classify customers into
"High Value" and "Low Value" categories based on attributes like Income, Age, and Spending.
1. IF (Income > 50,000) AND (Spending > 2,000) THEN Class = "High Value"
2. IF (Income <= 50,000) AND (Age > 30) THEN Class = "Low Value"
3. IF (Income <= 50,000) AND (Age <= 30) AND (Spending < 1,000) THEN Class = "Low
Value"
4. IF (Spending > 5,000) THEN Class = "High Value"
Each rule will classify a customer into either the "High Value" or "Low Value" category based on
their attributes.
When multiple rules match the same instance, conflict resolution techniques help in choosing
which rule to apply:
1. Rule Ordering: Rules are applied in a predefined sequence. When a rule is satisfied, it
is used to classify, and other matching rules are ignored.
2. Rule Priority: Rules are given priority scores or ranks. When conflicts occur, the rule
with the highest priority is chosen.
3. Rule Weighting: Rules may have weights based on accuracy or other metrics, with the
rule having the highest weight used for classification.
4. Rule Specificity: More specific rules are prioritized over more general rules. For
instance, a rule with three conditions may be considered more specific than a rule with
two conditions.
Applications of Rule-Based Classification
Consider a sample dataset with attributes for Age, Education Level, and Marital Status to
predict whether a person is Eligible for a Loan.
Generating Rules
1. IF (Age > 30) AND (Education Level = Bachelor or higher) THEN Eligible for Loan =
"Yes"
2. IF (Age < 30) AND (Education Level = High School) THEN Eligible for Loan = "No"
Each rule captures a pattern in the data, and the classifier uses these rules to predict whether
new applicants are eligible for a loan.
6. Confusion Matrix: A table that summarizes the model's predictions against actual outcomes.
It provides insight into true positives, true negatives, false positives, and false negatives.
1. Holdout Method: Split the dataset into a training set and a test set (e.g., 70-30 or 80-20
split). The model is trained on the training set and evaluated on the test set. This method
is simple but may yield variable results depending on the split.
2. Cross-Validation:
○ K-Fold Cross-Validation: Divides the data into kkk subsets (folds). The model is
trained kkk times, each time using k−1k-1k−1 folds for training and the remaining
fold for testing. The average of the kkk scores is taken as the final performance
metric.
○ Leave-One-Out Cross-Validation (LOOCV): A special case of k-fold
cross-validation where kkk equals the number of instances in the dataset. Each
instance is used once for testing while all others are used for training.
○ Stratified K-Fold Cross-Validation: Ensures each fold has a representative
proportion of classes, making it suitable for imbalanced datasets.
3. Bootstrapping: A sampling technique that involves repeatedly sampling with
replacement from the original dataset to create multiple training sets. The model is
evaluated on the remaining instances each time. Bootstrapping is particularly useful
when the dataset is small.
Model Selection
Selecting the best model is not only about accuracy but also considers factors such as:
1. Data Splitting: Split the dataset into training and test sets (e.g., 80-20).
2. Model Training:
○ Train both Naïve Bayes and SVM models on the training set.
3. Cross-Validation: Use 5-fold cross-validation on each model to get average accuracy,
precision, recall, and F1 Score.
4. Evaluate Models:
○ Calculate performance metrics for each model on the test set.
○ Examine the confusion matrix to understand how each model handles false
positives and false negatives.
5. Model Selection:
○ If the SVM shows higher F1 Score and AUC than Naïve Bayes and meets
interpretability or computational requirements, select SVM.
○ If Naïve Bayes performs similarly with lower computation, it may be selected due
to simplicity.
Improving classification accuracy is essential to building robust models that make reliable
predictions. Several techniques can help enhance the performance of classification models,
from data preprocessing and feature engineering to advanced modeling methods. Here are
some of the most effective techniques for improving classification accuracy:
1. Data Preprocessing
Quality data is the foundation of any accurate model. Preprocessing steps that improve data
quality can significantly boost accuracy.
● Handling Missing Values: Replace or impute missing values to avoid data sparsity,
which can affect model performance. Imputation techniques include mean, median,
mode replacement, or advanced methods like k-nearest neighbors (KNN) imputation.
● Data Normalization or Standardization: Scaling features to a standard range (e.g., 0–1
or z-scores) can improve convergence and performance for algorithms sensitive to
feature scales, such as SVM, KNN, and neural networks.
● Dealing with Outliers: Outliers can skew model results, especially in models sensitive
to extremes (e.g., linear regression). Identifying and handling outliers through capping or
removal can lead to better generalization.
2. Feature Engineering
Good features can improve model accuracy by making patterns in data more accessible to the
algorithm.
● Feature Selection: Identify and retain only the most relevant features. This can reduce
noise and improve model performance. Methods for feature selection include:
○ Filter methods: Correlation analysis, chi-square tests.
○ Wrapper methods: Recursive Feature Elimination (RFE), forward and backward
selection.
○ Embedded methods: Regularization techniques like Lasso (L1) and Ridge (L2),
which penalize less relevant features.
● Feature Extraction and Transformation: Create new features by combining or
transforming existing ones. For example, adding polynomial or interaction terms can help
linear models capture nonlinear relationships.
● Dimensionality Reduction: Techniques like Principal Component Analysis (PCA) and
Linear Discriminant Analysis (LDA) can reduce dimensionality while retaining the most
informative features, improving model accuracy and efficiency.
Choosing the right algorithm and tuning its hyperparameters can significantly impact
classification accuracy.
● Algorithm Choice: Some models may naturally perform better on specific datasets.
Experiment with different algorithms (e.g., Decision Trees, SVM, Random Forest,
Gradient Boosting, and Neural Networks) to see which provides the highest baseline
accuracy.
● Hyperparameter Tuning: Use techniques like Grid Search or Random Search to
optimize hyperparameters for maximum accuracy. Bayesian Optimization or genetic
algorithms can provide a more efficient search for the best parameters, especially for
complex models.
4. Ensemble Learning
Ensemble methods combine the predictions of multiple models to improve accuracy and
robustness.
● Bagging (Bootstrap Aggregating): Trains multiple models independently on random
subsets of the training data. The final prediction is the average (for regression) or
majority vote (for classification). Random Forest is a popular bagging algorithm.
● Boosting: Sequentially builds a series of models where each model attempts to correct
the errors of the previous one. Gradient Boosting and AdaBoost are widely used
boosting methods.
● Stacking: Combines different models by training a “meta-model” that learns how to best
aggregate the base models' predictions. Stacking can significantly enhance
performance, especially when diverse algorithms are combined.
Resampling methods can help reduce overfitting and provide a more accurate measure of
model performance.
● K-Fold Cross-Validation: Splits data into kkk folds and trains kkk different models, each
using k−1k-1k−1 folds for training and the remaining fold for validation. The final score is
the average of the kkk scores, which helps in estimating model accuracy on unseen
data.
● Stratified K-Fold Cross-Validation: Ensures each fold has the same proportion of
classes as the entire dataset, which is particularly useful for imbalanced data.
● Bootstrap Resampling: Trains multiple models on random samples of the dataset,
providing an average accuracy and confidence intervals for the model's performance.
Imbalanced datasets can lead to biased models that favor the majority class. Techniques to
address class imbalance include:
● Resampling Methods:
○ Oversampling: Duplicates or generates new instances of the minority class.
Synthetic Minority Over-sampling Technique (SMOTE) is a popular method
for oversampling.
○ Undersampling: Removes instances from the majority class to balance the
dataset, which can work well if the majority class has redundant data.
● Cost-Sensitive Learning: Adjusts the model to penalize misclassification of the minority
class more heavily, making the model pay more attention to it. This is often implemented
by adjusting the loss function in algorithms like decision trees, SVMs, and neural
networks.
7. Model Regularization
● Lasso (L1) Regularization: Adds a penalty equal to the absolute value of the magnitude
of coefficients. It forces some coefficients to zero, effectively selecting a simpler model.
● Ridge (L2) Regularization: Adds a penalty equal to the square of the coefficients'
magnitude, preventing coefficients from becoming too large.
● Elastic Net: Combines L1 and L2 penalties, balancing feature selection and
regularization, particularly useful when there are highly correlated features.
If using neural networks, several specific techniques can help improve classification accuracy:
● Batch Normalization: Normalizes the output of each layer, reducing internal covariate
shift and enabling faster convergence and better accuracy.
● Dropout: A form of regularization where a fraction of nodes in a layer are randomly
"dropped out" during each training iteration to prevent overfitting.
● Data Augmentation: In image or text classification, augmenting the training data by
applying transformations (e.g., rotation, translation) can improve generalization.
9. Active Learning
In cases where labeling data is expensive, active learning helps by focusing on the most
informative instances. The model actively selects samples that are difficult to classify and
requests additional information, which can lead to better accuracy with fewer labeled examples.
Suppose we’re building a spam classifier for emails using logistic regression. After training a
baseline model, we find it has an accuracy of 82%. We could apply some techniques to improve
this:
1. Data Preprocessing: Remove irrelevant information, normalize text lengths, and handle
missing values.
2. Feature Engineering: Extract new features like word count, presence of common spam
words, or sender domain.
3. Model Tuning: Use grid search to tune hyperparameters like the regularization strength.
4. Ensemble Learning: Combine logistic regression with decision trees or Naïve Bayes
using a voting classifier.
5. Cross-Validation: Perform 10-fold cross-validation to ensure stability across different
data splits.
After implementing these techniques, the spam classifier achieves 89% accuracy, with better
performance on minority classes like rare spam types.
Classification by Backpropagation
1. Artificial Neural Network (ANN): ANNs are composed of layers of neurons (nodes)
connected by weights. Each neuron processes input data and passes it to subsequent
layers until it reaches the output layer, where the final classification occurs.
2. Forward Pass: In the forward pass, the input data moves through the network's layers,
producing an output. During this pass, the weighted sum of inputs is computed at each
neuron and passed through an activation function (e.g., ReLU, sigmoid) to introduce
non-linearity.
3. Loss Function: After the forward pass, the network's output is compared to the actual
target values using a loss function. Common loss functions for classification include
cross-entropy loss for multi-class classification and binary cross-entropy for binary
classification.
4. Backpropagation (Backward Pass): Backpropagation is a process of calculating the
gradient of the loss function with respect to each weight in the network. It uses the chain
rule to propagate errors backward through the network, layer by layer, adjusting weights
to minimize the error in future predictions.
5. Gradient Descent: The backpropagation algorithm relies on gradient descent or its
variants (e.g., stochastic gradient descent, Adam optimizer) to update the weights by
moving them in the direction that reduces the loss.
1. Initialization: Initialize weights and biases randomly or with a small value close to zero.
2. Forward Pass: Feed the input through the network layer by layer until reaching the
output layer. Compute the weighted sums and apply activation functions at each layer.
3. Loss Computation: Calculate the loss (error) between the predicted output and the
actual target output using the loss function.
4. Backward Pass (Backpropagation):
○ Compute the gradient of the loss function with respect to each weight in the
network.
○ Starting from the output layer, propagate these gradients backward to update
weights and biases of each neuron in each layer using the chain rule.
5. Weight Update: Adjust the weights and biases by subtracting the product of the learning
rate and the gradient. Repeat the forward and backward passes for a specified number
of epochs or until the loss converges to a minimum.
6. Iterate: Repeat the forward and backward passes for each sample (or mini-batch) over
multiple epochs until the network converges on a low-error solution.
Activation Functions
Activation functions introduce non-linearities into the network, enabling it to learn complex
patterns. Common activation functions include:
● Flexibility: Can classify data into multiple categories, making it ideal for complex,
multi-class problems.
● Adaptability: Backpropagation can be applied to a wide range of architectures, from
simple feedforward neural networks to deep convolutional and recurrent networks.
● Non-linear Decision Boundaries: With multiple hidden layers and non-linear activation
functions, networks trained with backpropagation can capture complex, non-linear
relationships in data.
1. Data: Use the MNIST dataset, which contains 60,000 training images and 10,000 testing
images of digits (0-9).
2. Network Architecture:
○ Input Layer: 784 neurons (one for each pixel in a 28x28 image).
○ Hidden Layers: Two hidden layers with 128 and 64 neurons respectively, each
using the ReLU activation function.
○ Output Layer: 10 neurons (one for each digit), using the Softmax activation
function.
3. Training:
○ Forward Pass: Each image is passed through the network, producing a
probability distribution over the 10 possible digits.
○ Loss Calculation: Cross-entropy loss is used to calculate the difference
between predicted and actual labels.
○ Backpropagation: Calculate gradients of the loss with respect to weights and
biases, propagating backward to adjust them.
○ Weight Update: Update weights using stochastic gradient descent or a similar
optimizer.
4. Result: After training, the model is evaluated on the test set. If the network is trained
properly, it should achieve high accuracy (often above 95%) on the test set.
● Overfitting: When the network fits the training data too closely, it performs poorly on
unseen data.
○ Solution: Use regularization techniques (like L2 regularization), dropout, or data
augmentation.
● Vanishing Gradient: Gradients can become very small in deeper layers, slowing down
or stopping learning.
○ Solution: Use ReLU or other activation functions that mitigate vanishing
gradients.
● Computational Intensity: Backpropagation requires significant computational power,
especially in deep networks.
○ Solution: Use mini-batch gradient descent and optimize code with GPUs or
specialized hardware (e.g., TPUs).
Advantages:
● Highly effective for complex classification tasks.
● Can handle large amounts of data and learn intricate patterns.
● Adaptable to various architectures, including convolutional and recurrent neural
networks.
Limitations:
Summary
The K-Nearest Neighbor (K-NN) classifier is a simple, yet effective, algorithm used in machine
learning for both classification and regression tasks. It is a lazy learning algorithm, meaning
that it doesn’t require training before making predictions. Instead, it memorizes the training
dataset and makes predictions based on the similarity of a new data point to the points in the
training set.
1. Instance-based Learning: K-NN stores the training dataset and makes decisions based
on the distance between the query point and its nearest neighbors.
2. No Model Building: Unlike other algorithms (like decision trees, SVMs, etc.), K-NN does
not build an explicit model during training. It only requires data storage and comparison
during prediction.
3. Distance Metric: K-NN uses a distance metric (such as Euclidean distance) to measure
similarity between data points. The most common distance metrics used are:
How K-NN Works:
The K-NN algorithm works in two main phases: training and prediction.
1. Training Phase:
○ K-NN stores all the training data (i.e., it "memorizes" the data). No model is
created at this stage.
2. Prediction Phase:
○ To classify a new data point, K-NN checks the "K" nearest neighbors from the
training dataset.
○ The class label of the query point is determined by a majority vote (for
classification) or by averaging (for regression) the labels of the K nearest
neighbors.
3. For example:
○ In classification, the new data point is assigned the class label most frequent
among its K-nearest neighbors.
○ In regression, the predicted value is the average of the output values of its
K-nearest neighbors.
1. Select the number K: Choose the number of nearest neighbors to consider (typically an
odd number to avoid ties in binary classification).
2. Distance Calculation: Calculate the distance between the new data point and all other
points in the training set.
3. Find the K Nearest Neighbors: Sort the distances in ascending order and select the top
K neighbors.
4. Majority Voting: For classification tasks, the predicted class label of the new data point
is the majority class of the K nearest neighbors.
5. Return the Predicted Label: Output the class label based on the majority vote.
It is common practice to experiment with different values of K using cross-validation to find the
optimal K.
Advantages of K-NN:
Disadvantages of K-NN:
1. Computational Complexity: As K-NN stores all the training data, making predictions
can be slow, especially when the dataset is large.
○ The computational cost increases with the number of data points and the number
of features.
2. Memory Usage: K-NN requires storing the entire training set in memory, which can be
inefficient for large datasets.
3. Sensitivity to Irrelevant Features: K-NN may perform poorly if the dataset has many
irrelevant features, as the distance calculation will be affected by these features.
4. Curse of Dimensionality: As the number of features increases, the distance between
points becomes less informative (all points become equidistant in very high-dimensional
spaces). This is known as the curse of dimensionality.
5. Choice of Distance Metric: The choice of distance metric can heavily influence
performance. For example, Euclidean distance may not be suitable for all data types,
especially if the features are not on the same scale.
Let’s consider a simple example of classifying data points based on their distance from
neighbors.
Problem:
We have a dataset of points with two features (X1 and X2) and two classes (A and B):
Solution:
Choose K: Let K=3 (we will consider the 3 nearest neighbors).
Calculate Distance: Compute the distance between the new point (5,5)(5, 5)(5,5) and each of
the training points using Euclidean distance: