0% found this document useful (0 votes)
10 views37 pages

AIML-Unit 3 Notes-Assignment 3

The document provides an overview of key concepts in artificial intelligence and machine learning, focusing on Bayesian learning, computational learning, and various learning paradigms. It explains Bayes' theorem, maximum likelihood estimation, and the minimum description length principle, emphasizing their applications in model selection and data analysis. Additionally, it discusses the importance of avoiding over-fitting and under-fitting in model training.

Uploaded by

Uday Sai
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views37 pages

AIML-Unit 3 Notes-Assignment 3

The document provides an overview of key concepts in artificial intelligence and machine learning, focusing on Bayesian learning, computational learning, and various learning paradigms. It explains Bayes' theorem, maximum likelihood estimation, and the minimum description length principle, emphasizing their applications in model selection and data analysis. Additionally, it discusses the importance of avoiding over-fitting and under-fitting in model training.

Uploaded by

Uday Sai
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

INTRODUCTION TO ARTIFICIAL INTELLIGENCE & MACHINE LEARNING

UNIT- III
Bayesian and Computational Learning: Bayes theorem , concept learning, maximum likelihood,
minimum description length principle, Gibbs Algorithm, Naïve Bayes Classifier, Instance Based
Learning- K-Nearest neighbour learning
Introduction to Machine Learning (ML): Definition, Evolution, Need, applications of ML in industry
and real world, classification; differences between supervised and unsupervised learning paradigms.

Bayesian Learning
Bayesian learning is a statistical approach that applies Bayes' theorem to update the probability of
a hypothesis as new evidence or data becomes available. It combines prior knowledge (prior
probability) with new data (likelihood) to form a posterior probability, which gives a more refined
belief about the hypothesis. This approach allows for the incorporation of uncertainty in model
parameters, making it useful in situations where data is noisy or incomplete.

Computational Learning
Computational learning refers to the study and application of algorithms that enable machines to
improve their performance over time through experience or data. It involves the development of
models and algorithms that can learn from data, whether it's supervised, unsupervised, or
reinforcement learning. Computational learning focuses on the efficiency and scalability of learning
algorithms, often involving optimization techniques, data structures, and statistical methods.

Bayes theorem
 Bayes’ Theorem is used to determine the conditional probability of event.
 It is used to find the probability of an event, based on prior knowledge of conditions that
might be related to that event.

Terms Related to Bayes Theorem: Some important terms related to the concepts covered in
formula.
 Hypotheses: Events happening in the sample space E1, E2,… En is called the hypotheses
 Priori Probability: Priori Probability is the initial probability of an event occurring before any new
data is taken into account. P(Ei) is the priori probability of hypothesis Ei.
 Posterior Probability: Posterior Probability is the updated probability of an event after
considering new information. Probability P(Ei|A) is considered as the posterior probability of
hypothesis Ei.
Bayes Theorem Statement

Bayes’ Theorem for n set of events is defined as,

Let E1, E2,…, En be a set of events associated with the sample space S, in which all the events E1, E2,…,
En have a non-zero probability of occurrence. All the events E1, E2,…, E form a partition of S. Let A be
an event from space S for which we have to find probability, then according to Bayes’ theorem,


for k = 1, 2, 3, …., n
where,
• P(A) and P(B) are the probabilities of events A and B also P(B) is never equal to zero.
• P(A|B) is the probability of event A when event B happens
• P(B|A) is the probability of event B when A happens

Example Scenario:
Imagine there's a medical test for a disease, and we want to know the probability that someone
actually has the disease, given that they tested positive for it.
Given:
P(Disease) = Probability that a person has the disease = 0.1 (10% of the population has the disease)
P(No Disease) = Probability that a person does not have the disease = 0.9 (90% of the population
does not have the disease)
P(Test Positive | Disease) = Probability of testing positive if you have the disease = 0.95 (95% true
positive rate)
P(Test Positive | No Disease) = Probability of testing positive if you do not have the disease (false
positive rate) = 0.05 (5% false positive rate)

What we want to find:


We want to find P(Disease | Test Positive), the probability that a person has the disease given that
they tested positive.

Using Bayes' Theorem:


𝐏 𝐓𝐞𝐬𝐭𝐏𝐨𝐬𝐢𝐭𝐢𝐯𝐞 ∣ 𝐃𝐢𝐬𝐞𝐚𝐬𝐞 𝐏 𝐃𝐢𝐬𝐞𝐚𝐬𝐞
𝐏 𝐃𝐢𝐬𝐞𝐚𝐬𝐞 ∣ 𝐓𝐞𝐬𝐭𝐏𝐨𝐬𝐢𝐭𝐢𝐯𝐞
𝐏 𝐓𝐞𝐬𝐭𝐏𝐨𝐬𝐢𝐭𝐢𝐯𝐞
Where
P(Test Positive) is the total probability of testing positive, which can be found using the law of total
probability:
P(TestPositive)= P(TestPositive∣Disease)⋅P(Disease)+P(TestPositive∣NoDisease)⋅P(NoDisease)

Step-by-step Calculation:
P(TestPositive)= P(TestPositive∣Disease)⋅P(Disease)+P(TestPositive∣NoDisease)⋅P(NoDisease)
P(TestPositive) = (0.95⋅0.1)+(0.05⋅0.9)=0.095+0.045=0.14
P(Disease | Test Positive) = 0.95⋅0.10.14=0.0950.14≈0.679

Interpretation:
 The probability that a person actually has the disease, given that they tested positive, is about
67.9%.
 Even though the test is fairly accurate (95% true positive rate), the relatively high false positive
rate (5%) and the low initial probability of having the disease (10%) still make it somewhat
unlikely that someone who tests positive actually has the disease.

Concept Learning
 It is the process of learning to categorize objects, events, or ideas based on shared features or
attributes.
 The goal of concept learning is to recognize patterns that define a particular class or category
and use those patterns to make decisions about new, unseen instances.
 Concept learning is an essential part of how we categorize the world around us.
 It allows us to use prior knowledge and observed features to group similar items together and
make decisions about new items.
 This process is used widely in machine learning (e.g., image recognition), cognitive science, and
everyday decision-making.

Key terms:
 Concept: A class or category of objects that share common features.
 Features: The attributes or properties that define the concept.
 Generalization: Learning a concept by identifying the common features across different
examples.
 Classification: Assigning new instances to a category based on the learned concept.

Example:
Let's say you are learning the concept of "Fruit" and are given a set of examples that include apple,
banana, and carrot. You must determine which features are shared by the fruits.
 Step 1: Observation of Examples
o Apple: A round, sweet, red or green fruit.
o Banana: A long, sweet, yellow fruit.
o Carrot: A long, orange vegetable, not sweet in the same way as fruits.
 Step 2: Identifying Common Features
o Fruit features:
o Generally sweet taste (applies to apple and banana)
o Edible, grows on trees or plants
o Mostly have seeds inside (apple and banana)
o Non-fruit (e.g., carrot):
o Not as sweet
o Grows underground
o Does not have seeds in the same sense as fruits
 Step 3: Generalizing
o Based on these observations, you can generalize the concept of "fruit" as a sweet, edible
item that typically grows on a tree or plant and contains seeds inside.
o With this concept in mind, you would classify apples and bananas as fruits and carrots as
not a fruit.

Maximum Likelihood (ML)


 It is a statistical method used to estimate the parameters of a probability distribution or model,
based on observed data.
 The goal of ML is to find the values of the parameters that make the observed data most likely
under the assumed model.

Key Idea:
In simple terms, Maximum Likelihood asks:
What parameters of the model maximize the likelihood of observing the given data?
Steps of Maximum Likelihood Estimation:
 Define the likelihood function: This function expresses the probability of observing the data
given a set of parameters.
 Maximize the likelihood: Find the parameter values that maximize this function.
 Figure shows multiple attempts at fitting
the Parametric Density Estimation bell curve over
the random sample data.
 Red bell curves indicate poorly fitted Probability
Density Function and the green bell curve shows
the best fitting Parametric Density
Estimation over the data.
 Obtained the optimum bell curve by checking the
values in Maximum Likelihood Estimate plot
corresponding to each Parametric Density
Estimation.
 Red plots poorly fit the normal distribution, hence
their ‘maximum likelihood estimate’ is also lower.
 The green PDF curve has the maximum likelihood
estimate as it fits the data perfectly.
 This is how the maximum likelihood estimate
method works.

Probability Density Function (PDF) tells us how likely different outcomes are for a continuous
variable, while Maximum Likelihood Estimation helps us find the best-fitting model for the
data we observe.
Maximum Likelihood Estimation assumes that all Probability Density Functions are a likely
candidate to being the best fitting curve. Hence, it is computationally expensive method.
Minimum Description Length (MDL) Principle
 MDL principle is a concept in information theory and machine learning that aims to find the
model that best explains the data while also keeping the model as simple as possible.
 It is a way of balancing model fit and complexity to avoid over-fitting.
 The core idea behind MDL is that the best model is the one that allows you to describe both
the data and the model itself using the shortest possible total description length.
 The description length of a model consists of two parts:
• The length of the model: This refers to the complexity or the number of parameters of
the model.
• The length of the data given the model: This refers to how well the model compresses
or explains the data.
 MDL principle is a model selection criterion that helps in choosing the best model among a set of
candidate models, aiming to balance model complexity and fit to the data.
 It is based on the idea that the best model is the one that allows for the shortest total
description of both the model and the data it represents. In other words, it chooses the model
that provides the most compact explanation of the data.
 The MDL principle is rooted in information theory and statistical inference, and it can be thought
of as an extension of Occam’s Razor: the simpler the model, the better, as long as it fits the
data well.

Key Ideas of MDL:


1) Data Compression: MDL principle treats the task of model selection as a compression
problem. The goal is to minimize the total length of two parts:
• The model: How much information is needed to describe the model parameters.
• The data given the model: How much information is needed to describe the data,
assuming the model is true.
2) MDL Criterion: The total description length is the sum of:
• The description length of the model: How many bits or characters are required to
describe the model.
• The description length of the data given the model: How many bits or characters
are required to describe the data, assuming the model is correct.
The model that minimizes this total description length is the best model according to the MDL
principle.

Example of MDL in Action:


 Let’s consider a simple example where we want to fit a model to a set of data points.
 Suppose we are trying to model a relationship between X (input) and Y (output) using two
different models: a linear model and a quadratic model.
Example 1: Simple Model vs. Complex Model
Consider two models for a dataset of points on a 2D plane:
Model 1: A linear regression model (a straight line).
Model 2: A polynomial regression model of higher degree (e.g., cubic polynomial).
 If you have data that roughly follows a straight line but not exactly, the linear regression
model may not fit the data perfectly, but it still gives a reasonable explanation with few
parameters (just two: slope and intercept).
 On the other hand, a cubic polynomial may fit the data much better (closer to the points),
but it requires more parameters (the coefficients of the polynomial) and might even "over-
fit" the data by capturing noise.
 MDL suggests that the linear model, even though it may not fit the data perfectly, is likely
the better model because it has a shorter description length.
 It explains the data reasonably well with fewer parameters, while the cubic model, though it
fits better, is overly complex and has a longer description.

Example 2: Data Compression


Consider a simple example where we want to encode a sequence of data, such as a binary string. If
the sequence is:
1111111111111111 (a string of 16 ones)
• An efficient model might represent this sequence as "16 ones" (short description), rather
than storing the string directly.
• The description length of the model is short (just a count), and the data can be
reconstructed easily.
Now, if the sequence is: 1100101011011010
• There is no simple pattern or regularity, so the description length would be longer because
you'd need to store the actual string or a more complex model to encode it.

In MDL,
the goal is to choose the model that minimizes the total description length, which is a tradeoff
between:
• How well the model fits the data.
• The complexity of the model itself.
 By following the MDL principle, over-fitting (choosing overly complex models) can be
avoided while still capturing the structure in the data.
 Applications of minimum description length principle
 Important applications across various fields, particularly in machine learning, data
compression, and statistical modeling.

1. Model Selection in Machine Learning


 In machine learning, the MDL principle is used for model selection by helping to find the
simplest model that still adequately explains the data.
 The idea is to prevent over-fitting, where a more complex model fits the training data very well
but performs poorly on unseen data.
Example:
 In clustering, such as with the k-means algorithm, MDL can be used to select the optimal
number of clusters (k).
 If you have a range of possible k values, you can apply MDL to balance the complexity of the
clustering model (the number of clusters) and how well the model fits the data (the distance of
data points from their cluster centroids).
 A model with too many clusters may explain the data perfectly, but it will be overly complex,
whereas a model with too few clusters might fail to capture important structure in the data.

2. Regression and Feature Selection


 In regression models, especially when dealing with a large number of potential predictors, MDL
can help in feature selection.
 It suggests the smallest set of features (predictors) that can explain the variability in the
response variable without over-fitting.
Example:
 Suppose you are trying to predict house prices based on multiple features like square footage,
number of bedrooms, location, etc.
 Using MDL, you might choose a simpler regression model with fewer predictors if adding more
predictors does not significantly improve the fit of the model to the data.
 MDL helps to strike a balance between explanatory power and model complexity.

MDL principle is a powerful tool for balancing model accuracy with simplicity across many domains.
It can be used for tasks like model selection, feature selection, data compression, decision tree
pruning, time series forecasting, and even neural network design, helping to prevent over-fitting
and ensuring that models generalize well to new, unseen data.

Over-fitting
Over-fitting occurs when a model is too complex and learns the details or noise in the training data
to the extent that it negatively impacts the performance of the model on new, unseen data.
Essentially, the model "memorizes" the training data instead of generalizing from it.
Under-fitting
Under-fitting happens when a model is too simple to capture the underlying structure of the data.
It fails to learn the patterns in the training data adequately and, as a result, has poor performance
both on the training data and on unseen data.
Gibbs Algorithm
 Gibbs Sampling algorithm is a Markov Chain Monte Carlo (MCMC) technique that is used to
generate samples from complex multivariate distributions.
 It is especially helpful when you have a joint distribution that is difficult to sample from directly
but where conditional distributions of individual variables are easier to sample from.
 The idea behind Gibbs sampling is to iteratively update the values of each variable in the model
by sampling from its conditional distribution, given the current values of the other variables.
 The Problem: Sampling from a Multivariate Distribution
 In many problems, especially in Bayesian inference, we often need to sample from a multivariate
distribution P(X1,X2,...,Xn).
 However, this joint distribution is often complex, making direct sampling infeasible.
 Gibbs sampling algorithm addresses this by breaking the joint distribution into conditional
distributions.
 The central idea of Gibbs sampling is that although we may not be able to directly sample from
the joint distribution, we can sample from the conditional distribution of each variable, given
the current values of the other variables.

How Gibbs Sampling Works


 The algorithm generates samples from a target distribution by iteratively sampling from the
conditional distributions of each variable in the system, one at a time, and updating them based
on the current state of the other variables.
 Here is a step-by-step explanation of how the algorithm works.
Gibbs sampling is a powerful and widely-used Markov Chain Monte Carlo (MCMC) algorithm that
is particularly useful for sampling from complex, high-dimensional probability distributions. Here
are the main advantages and disadvantages of the Gibbs sampling algorithm:

Advantages of Gibbs Sampling:


Simplicity: The algorithm is easy to implement, especially for problems where the conditional
distributions are easy to sample from.
Convergence to Target Distribution: Gibbs sampling converges to the target joint distribution as
long as the Markov chain is irreducible and aperiodic (i.e., it can visit all possible states in the
space and does so in a way that no state is revisited in a fixed cycle).
Efficiency for Conditional Distributions: It works efficiently when the conditional distributions of
each variable are easy to sample from. This is often the case in many practical applications like
Bayesian inference.
No Need for Full Joint Distribution: Gibbs sampling does not require knowledge of the full joint
distribution of all the variables, just the conditional distributions of each variable given the
others.
Wide Applicability: It is applicable to a variety of complex models, such as those in Bayesian
statistics, image processing, machine learning, and computational biology.
Parallelizable (in some cases): In certain applications, if the conditional distributions can be
sampled independently, Gibbs sampling can be parallelized to speed up the process.

Disadvantages of Gibbs Sampling:


Slow Convergence: Gibbs sampling can suffer from slow convergence, especially in high-
dimensional spaces. It can take many iterations to move to the stationary distribution, especially
if the conditional distributions are highly correlated.
Dependence on Initialization: The algorithm’s convergence can be sensitive to the initial values
of the variables. Poor initial values may lead to slower convergence or getting stuck in local
modes.
No Guarantee of Efficient Sampling: If the conditional distributions are difficult to sample from,
Gibbs sampling may not be efficient or practical. In some cases, specialized methods may be
required to sample from certain conditionals.
Burn-in Period: The initial iterations of the chain may not represent the target distribution well
and are discarded as part of the "burn-in" phase. Identifying an adequate burn-in period can be
challenging.
Potential for High Autocorrelation: The samples from Gibbs sampling can be highly auto-
correlated, meaning that successive samples are not independent, which can lead to inefficient
use of computational resources and slow mixing of the chain.
Curse of Dimensionality: For very high-dimensional problems, the performance of Gibbs
sampling can degrade as it becomes more difficult to explore the entire parameter space
effectively.
Dependency on the Problem Structure: Gibbs sampling assumes that the conditional
distributions are easy to compute and sample from. For certain models, especially complex ones
with intricate dependencies, this might not be the case.

Gibbs sampling is a versatile and simple algorithm with many practical applications, its efficiency
can be affected by convergence issues, autocorrelation, and difficulty in sampling from certain
conditionals. It's important to assess the nature of the problem at hand to determine if Gibbs
sampling is the most suitable approach.

Naïve Bayes Classifier


 It is a probabilistic classifier based on Bayes' Theorem, which assumes that the features used
for classification are conditionally independent given the class label.
 It is widely used for classification tasks due to its simplicity, speed, and effectiveness,
especially in situations where the data is large and high-dimensional.
 Naïve Bayes is a simple yet powerful model for many real-world classification problems,
especially when computational efficiency and interpretability are important.

Where is Naïve Bayes Used?


Naïve Bayes classifiers are particularly useful in situations where:
 The dataset has many features (high dimensionality).
 Features are conditionally independent or can be assumed to be independent, which is often
true for text classification tasks.
 You need a simple and interpretable model that can be trained quickly.

Assumption
The fundamental Naive Bayes assumption is that each feature makes an:
 Feature independence: This means that when we are trying to classify something, we assume
that each feature (or piece of information) in the data does not affect any other feature.
 Continuous features are normally distributed: If a feature is continuous, then it is assumed to
be normally distributed within each class.
 Discrete features have multinomial distributions: If a feature is discrete, then it is assumed to
have a multinomial distribution within each class.
 Features are equally important: All features are assumed to contribute equally to the prediction
of the class label.
 No missing data: The data should not contain any missing values.
Advantages of Naïve Bayes Classifier
1. Simple and Easy to Implement:
o Naïve Bayes is easy to understand and implement. It is especially useful for beginners in
machine learning because of its simplicity.
2. Fast and Efficient:
o It is computationally efficient, making it ideal for large datasets. The training phase is
very fast since it only requires the calculation of probabilities.
3. Works Well with Small Datasets:
o Naïve Bayes often performs well with small datasets, especially when there is a clear
distinction between classes.
4. Handles Multi-class Problems:
o Naïve Bayes can be used for classification problems with more than two classes, making
it versatile for many applications.
5. Good Performance with High-dimensional Data:
o It performs well in high-dimensional spaces, like text classification tasks (e.g., spam
filtering or sentiment analysis), because each feature is treated independently.
6. Works Well with Categorical and Continuous Data:
o The algorithm can be used with both categorical and continuous data. For continuous
data, it assumes that the features follow a Gaussian distribution.
7. Robust to Irrelevant Features:
o Naïve Bayes can tolerate irrelevant features to some extent because it relies on
conditional independence between features, so irrelevant features have little impact on
the model’s performance.
8. Probabilistic Interpretation:
o It provides the probability of each class, which is valuable when you need to make
decisions based on the likelihood of an event.

Disadvantages of Naïve Bayes Classifier


1. Assumption of Feature Independence:
o The main limitation of Naïve Bayes is that it assumes the features are conditionally
independent, which is rarely the case in real-world data. This can lead to suboptimal
performance if the independence assumption is strongly violated.
2. Poor Performance with Correlated Features:
o If features are highly correlated, Naïve Bayes may perform poorly because it assumes no
relationship between features. This could result in inaccurate probability estimations.
3. Needs a Large Amount of Training Data for Rare Events:
o In cases of sparse data (e.g., rare classes or features), Naïve Bayes can assign zero
probability to unseen features, which can affect the model's accuracy. This can be
mitigated by using Laplace smoothing, but it's still a limitation.
4. Difficulty with Complex Relationships:
o Naïve Bayes may struggle with datasets where the relationships between features are
complex or non-linear. It is best suited for simpler problems where features are either
independent or weakly dependent.
5. Assumes Gaussian Distribution for Continuous Data:
o When used with continuous data, Naïve Bayes assumes that the data follows a normal
(Gaussian) distribution, which may not be the case in real-world data. This can lead to
poor performance if the data is not normally distributed.
6. Limited Flexibility:
o Naïve Bayes has limited flexibility compared to more complex models, like decision trees
or support vector machines, because it makes strong assumptions about the data.
7. Can Struggle with Imbalanced Data:
o In cases where the classes are imbalanced, Naïve Bayes might favor the majority class,
leading to biased predictions.

Naïve Bayes is a great choice for quick, probabilistic classification problems, especially when dealing
with text data or problems where features are relatively independent. Its speed and simplicity make
it suitable for real-time applications, but its assumptions of independence and distribution might
hinder performance in more complex scenarios.

Applications of Naïve Bayes Classifier


 Spam Filtering: Classifies emails as spam or not spam based on the likelihood of certain words
occurring in spam and non-spam emails.
 Sentiment Analysis: Analyzes the sentiment of text (e.g., tweets, product reviews) to determine
whether the opinion is positive, negative, or neutral.
 Text Classification: Used to categorize documents into predefined categories such as news
articles (sports, politics, etc.).
 Medical Diagnosis: Assesses the likelihood of diseases based on observed symptoms (e.g.,
predicting if a patient has a particular disease).
 Recommendation Systems: Based on user preferences, Naïve Bayes can classify items to
recommend to users.
 Speech Recognition: Classifies spoken words or sentences into predefined categories or
transcriptions.

Instance-based learning (IBL)


 Instance-based learning (IBL) is a type of machine learning where learning happens by storing
instances (or examples) of the training data and making predictions based on these stored
instances rather than explicitly learning a model.
 The key idea is that future predictions are made by comparing a new input instance to previously
seen instances in the training data, using some form of similarity measure.

Key Concepts:

Storage of Instances:
 The algorithm stores the entire training dataset (or selected examples) in memory.
 No explicit model is built during training.

Similarity Measure:
 When a new instance is encountered, it is compared to the stored instances using a distance or
similarity metric (e.g., Euclidean distance, cosine similarity).

Prediction: The prediction for a new instance is based on the most similar stored instances. Common
strategies include:
• k-Nearest Neighbors (k-NN): The class of the majority of the k-nearest neighbors determines the
prediction for classification tasks. For regression tasks, it’s often the average or weighted
average of the k-nearest neighbors’ outputs.
• Case-Based Reasoning: Similar instances are retrieved and adapted to make predictions or
decisions, often used in expert systems.

Advantages:
• Simple and Intuitive: IBL is easy to understand and implement.
• No Training Phase: It doesn’t require complex training like other algorithms, since it simply
stores the data.

Disadvantages:
• High Memory Usage: It stores all instances, which can become computationally expensive with
large datasets.
• Slow Prediction Time: Since predictions are based on comparisons with stored instances, it can
be slower compared to other models, especially with large datasets.

IBL is particularly useful for problems where the data is sparse or non-linear, or where it is difficult to
build an explicit model, like in certain types of classification, regression, or recommendation
systems.

K-Nearest Neighbour (K-NN)


 K-Nearest Neighbour (K-NN) is a simple, non-parametric, and widely used Supervised machine
learning algorithm for classification and regression tasks.
 It operates on the principle that data points that are close to each other are likely to have similar
labels or outcomes.
 In classification, K-NN assigns a class to a data point based on the majority class of its K closest
neighbors.
 In regression, the output is the average of the values of the K nearest neighbors.

Step-by-Step Procedure for K-NN


1. Choose the number of neighbors (K):
• Select the number of nearest neighbors (K) to consider.
• This is typically an odd number to avoid ties in classification.
2. Calculate the distance:
• For each point in the dataset, calculate the distance between the query point (the point you
want to classify or predict) and all the other points in the training set.
• Common distance metrics include:
• Euclidean Distance (most commonly used)
• Manhattan Distance
• Minkowski Distance
3. Sort the distances:
• Sort the distances calculated from the query point to all other points in the training set in
ascending order.
4. Select K closest neighbors:
• Choose the K points with the smallest distance to the query point.
5. Vote for classification (or average for regression):
• For classification: Assign the class that is most common among the K nearest neighbors.
• For regression: Assign the average value of the K nearest neighbors.
6. Return the result:
• Output the predicted class label (in classification) or predicted value (in regression).

Example: Predicting Customer Segments in a Retail Store


Scenario: A retail store wants to segment its customers into different groups based on their
purchasing behavior.

Steps:
 Collect customer data such as age, spending patterns, purchase frequency, and product
categories bought.
 Choose K (e.g., K = 3).
 For a new customer, calculate the distance between this new customer's data and all other
customers in the training dataset using a distance metric like Euclidean distance.
 Identify the K nearest neighbors (most similar customers).
 Assign the customer to the most frequent segment among the K nearest neighbors.
 Result: The new customer will be classified into a customer segment (e.g., high spender, bargain
shopper).

Advantages of K-NN
 Simple to understand and implement.
 No training phase: It's a lazy learner, meaning it doesn't require an explicit training phase.
 Works well with smaller datasets and when the decision boundary is non-linear.

Disadvantages of K-NN
 Computationally expensive: It needs to compute distances for all data points during prediction,
which can be slow for large datasets.
 Sensitive to irrelevant features and noisy data: The presence of irrelevant or noisy features can
significantly affect the performance of K-NN.
 Choice of K: The algorithm's performance is highly sensitive to the choice of K.

Applications of K-Nearest Neighbour


 Image Recognition: K-NN can be used to classify images based on pixel values, for example,
recognizing handwritten digits.
 Recommendation Systems: K-NN can be used to recommend products by finding users or items
that are "neighbors" of the target user or item, based on similarity in preferences.
 Medical Diagnosis: K-NN can be used for disease prediction based on symptoms. For example,
classifying whether a patient is likely to have a certain disease based on historical data of
symptoms and diagnoses.
 Anomaly Detection: K-NN can be used to detect outliers in data by looking for data points that
are far away from their neighbors (in a high-dimensional feature space).
 Text Classification: K-NN can be used to classify documents (such as emails or news articles) into
categories based on their content.
Introduction to Machine Learning (ML):
Definition of Machine Learning
Machine Learning (ML) is a branch of artificial intelligence (AI) that focuses on developing algorithms
that allow computers to learn from data and make decisions or predictions without explicit
programming. In ML, systems use patterns and inferences from past experiences (data) to improve
performance on tasks.
Evolution of Machine Learning: The evolution of machine learning is given below:
 Early Beginnings (1950s-1960s):
o Machine learning origins traced back to the development of artificial intelligence
(AI).
o Pioneers like Alan Turing and John McCarthy laid the groundwork.
o Early algorithms focused on pattern recognition, such as decision trees and simple
neural networks.
 Perceptron and Early Neural Networks (1960s-1970s):
o Frank Rosenblatt introduced the perceptron, an early neural network model.
o The perceptron was limited and led to a period of reduced interest in neural
networks, known as the AI winter.
 Statistical Methods and Probabilistic Models (1980s):
o Statistical learning theory gained traction (e.g., Support Vector Machines and
Bayesian Networks).
o Machine learning began emphasizing probabilistic models and learning from data
rather than explicit programming.
 Back propagation and Neural Networks Resurgence (1986):
o The back propagation algorithm, popularized by Geoffrey Hinton and others,
revitalized interest in neural networks.
o Led to deeper networks, laying the foundation for modern deep learning techniques.
 Rise of Support Vector Machines and Ensemble Methods (1990s):
o Support Vector Machines (SVMs) became widely used for classification problems.
o Ensemble learning techniques like Random Forests and Boosting emerged,
enhancing predictive accuracy.
 Big Data and Computational Power (2000s):
o The rise of the internet, smartphones, and sensors generated massive amounts of
data.
o Increased computational power allowed more complex models to be trained
efficiently, especially for deep learning.
 Deep Learning Revolution (2010s-Present):
o Deep Learning techniques, especially Convolutional Neural Networks (CNNs) and
Recurrent Neural Networks (RNNs), transformed fields like image recognition and
natural language processing.
o Breakthroughs in areas like AlphaGo, self-driving cars, and language models (e.g.,
GPT series) gained widespread attention.
 Explainability and Ethics (2010s-Present):
o Focus on interpretable AI and ethical concerns regarding bias, transparency, and
accountability in machine learning models.
o Growing interest in creating models that can be understood by humans, especially in
critical sectors like healthcare and finance.
 Current Trends (2020s):
o Transformers and large language models (e.g., GPT-3&4, Gemini, Copilot) are
dominating natural language processing tasks.
o On-going research in reinforcement learning, unsupervised learning.
o Machine learning is being integrated into nearly every industry, from healthcare to
finance, entertainment, and autonomous systems.

Need of Machine Learning:


 Data Explosion: The growing volume of data generated by businesses, social media, sensors,
etc., creates a need for tools to analyse and make sense of it.
 Automation: ML can automate repetitive tasks, improving efficiency and reducing human error.
 Personalization: ML helps tailor services and recommendations (like Netflix suggestions or
personalized marketing) based on user behaviour.
 Complexity: In fields like healthcare and finance, ML aids in managing complex datasets and
making better decisions that humans may not easily achieve.

Applications of ML in industry and real world


 Healthcare: ML is used for disease diagnosis, predicting patient outcomes, drug discovery, and
personalized treatment plans.
 Finance: In finance, ML helps in fraud detection, algorithmic trading, and credit scoring.
 Retail: ML powers recommendation systems (e.g., Amazon’s product suggestions), inventory
management, and customer sentiment analysis.
 Automotive: Self-driving cars use ML for object detection, navigation, and decision-making.
 Manufacturing: Predictive maintenance, quality control, and production optimization are
achieved using ML models.
 Entertainment: Platforms like ‘Netflix’, ‘YouTube’, and ‘Spotify’, use ML to recommend content
tailored to individual preferences.
 Natural Language Processing (NLP): ML helps with speech recognition, sentiment analysis, and
language translation in applications like ‘Siri’, ‘Google Translate’, and chatbots.
Machine Learning is transforming how businesses and industries operate by offering automation,
smarter decision-making, and personalized experiences.

Classification
 Classification teaches a machine to sort things into categories.
 It learns by looking at examples with labels (like emails marked “spam” or “not spam”).
 After learning, it can decide which category new items belong to, like identifying if a new email is
spam or not.
 In simpler terms, classification involves assigning input data into predefined categories or
classes.

Key Characteristics of Classification:


 Labelled Data: The algorithm is trained on a dataset where each input is paired with a correct
label (i.e., output class).
 Discrete Output: The output is categorical (e.g., "spam" or "not spam", "cat" or "dog", etc.).
 Goal: The objective is to learn a mapping from input features to the correct class label so that
the model can predict the class of new, unseen data.
 For example a classification model as shown in above figure, might be trained on dataset of
images labelled as either dogs or cats .
 It can be used to predict the class of new and unseen images as dogs or cats based on their
features such as colour, texture and shape.
 Explaining classification in ML, horizontal axis represents the combined values of color and
texture features. Vertical axis represents the combined values of shape and size features.
 Each colored dot in the plot represents an individual image, with the color indicating
whether the model predicts the image to be a dog or a cat.
 The shaded areas in the plot show the decision boundary, which is the line or region that
the model uses to decide which category (dog or cat) an image belongs to.
 The model classifies images on one side of the boundary as dogs and on the other side as
cats, based on their features.

How does Classification in Machine Learning Work?


 Classification involves training a model using a labelled dataset, where each input is paired with
its correct output label.
 The model learns patterns and relationships in the data, so it can later predict labels for new,
unseen inputs.
 In machine learning, classification works by training a model to learn patterns from labelled
data, so it can predict the category or class of new, unseen data.
 Working principle of classification is shown in figure and the steps are explained below:
1) Data Collection: You start with a dataset where each item is labelled with the correct class
(for example, “cat” or “dog”).
2) Feature Extraction: The system identifies features (like colour, shape, or texture) that help
distinguish one class from another. These features are what the model uses to make
predictions.
3) Model Training: Classification – machine learning algorithm uses the labelled data to learn
how to map the features to the correct class. It looks for patterns and relationships in the
data.
4) Model Evaluation: Once the model is trained, it’s tested on new, unseen data to check how
accurately it can classify the items.
5) Prediction: After being trained and evaluated, the model can be used to predict the class of
new data based on the features it has learned.
6) Model Evaluation: Evaluating a classification model is a key step in machine learning. It
helps us check how well the model performs and how good it is at handling new, unseen
data. Depending on the problem and needs we can use different metrics to measure its
performance.

 If the quality metric is not satisfactory, the ML algorithm or hyper parameters can be adjusted,
and the model is retrained.
 This iterative process continues until a satisfactory performance is achieved.
 In short, classification in machine learning is all about using existing labelled data to teach the
model how to predict the class of new, unlabelled data based on the patterns it has learned.

Common Algorithms Used for Classification:


 Logistic Regression: Used for binary and multiclass classification tasks.
 Decision Trees: A tree-like model that makes decisions based on feature values.
 Random Forests: An ensemble method that combines multiple decision trees to improve
performance.
 Support Vector Machines (SVM): Used for both linear and non-linear classification tasks.
 K-Nearest Neighbours (KNN): Classifies data based on the majority class of its nearest
neighbours.
 Naive Bayes: A probabilistic classifier based on Bayes' theorem.
 Neural Networks: Deep learning models used for complex classification tasks (e.g., image or
speech classification).

Applications of ML in industry and real world


Classification algorithms are widely used in many real-world applications across various domains,
including:
 Email spam filtering
 Credit risk assessment: Algorithms predict whether a loan applicant is likely to default by
analysing factors such as credit score, income, and loan history. This helps banks make informed
lending decisions and minimize financial risk.
 Medical diagnosis : Machine learning models classify whether a patient has a certain condition
(e.g., cancer or diabetes) based on medical data such as test results, symptoms, and patient
history. This aids doctors in making quicker, more accurate diagnoses, improving patient care.
 Image classification : Applied in fields such as facial recognition, autonomous driving, and
medical imaging.
 Sentiment analysis: Determining whether the sentiment of a piece of text is positive, negative,
or neutral. Businesses use this to understand customer opinions, helping to improve products
and services.
 Fraud detection : Algorithms detect fraudulent activities by analysing transaction patterns and
identifying anomalies crucial in protecting against credit card fraud and other financial crimes.
 Recommendation systems : Used to recommend products or content based on past user
behaviour, such as suggesting movies on Netflix or products on Amazon. This personalization
boosts user satisfaction and sales for businesses.

ML Classification

Supervised Learning:
 Definition: In supervised learning, the algorithm learns from labelled training data. The goal is to
map input data to the correct output using examples that have known results.
 Process: The algorithm is provided with pairs of inputs and outputs, and it learns to predict the
output for unseen inputs as shown in figure below.
 Applications:
 Classification: Categorizing data into predefined labels (e.g., email spam detection,
image recognition).
Regression: Predicting continuous values (e.g., house price prediction, stock market
forecasting).
 Examples of Algorithms:
 Linear Regression, Logistic Regression
 Decision Trees, Random Forests
 Support Vector Machines (SVM)
 Neural Networks

Unsupervised Learning:
 Definition: In unsupervised learning, the algorithm works with data that does not have labelled
outputs. The system tries to find patterns or structure within the data without prior knowledge
of what the output should be.
 Process: The algorithm identifies hidden structures or relationships in the input data, such as
clustering similar data points or reducing dimensionality and shown in figure below.
 Applications:
 Clustering: Grouping similar data points (e.g., customer segmentation, anomaly
detection).
 Dimensionality Reduction: Reducing the number of features while maintaining essential
information (e.g., principal component analysis - PCA).
 Examples of Algorithms:
 K-Means Clustering, Hierarchical Clustering
 Principal Component Analysis (PCA)
 DBSCAN, Gaussian Mixture Models
Reinforcement Learning:
 Definition: In reinforcement learning (RL), the algorithm learns by interacting with an
environment. The agent (the model) takes actions and receives feedback in the form of rewards
or penalties, based on the outcomes of those actions.
 Process: The goal is to learn a policy that maximizes the cumulative reward over time. RL is used
when the correct actions are not explicitly known, and the agent must learn through trial and
error and is shown in the figure below.
 Applications:
 Game Playing: Algorithms like AlphaGo or chess engines.
 Robotics: Teaching robots how to move or manipulate objects.
 Self-Driving Cars: Making driving decisions based on environmental interactions.
 Examples of Algorithms:
 Q-learning, Deep Q-Network (DQN)
 Policy Gradient Methods, Actor-Critic Methods

Semi-Supervised Learning:
 Definition: Semi-supervised learning is a mix of supervised and unsupervised learning, where the
algorithm is provided with a small amount of labelled data and a large amount of unlabelled
data. It uses the labelled data to guide the learning of patterns from the unlabelled data.
 Applications: This method is used when labelled data is scarce or expensive to obtain, but there
is a lot of unlabelled data available.
 Examples of Algorithms: Semi-supervised SVM, Graph-based methods.

Self-Supervised Learning:
 Definition: A form of unsupervised learning where the model learns to predict part of the input
data using other parts. It's a method that generates pseudo-labels from the data itself, and it's
often used in natural language processing and computer vision.
 Applications: Commonly used for pre-training models that can later be fine-tuned with labelled
data (e.g., BERT for NLP, contrastive learning in computer vision).
 Examples of Algorithms: Contrastive Learning, Auto-encoders
Summary of ML Classification:
 Supervised Learning: Uses labelled data to learn and predict outputs (classification, regression).
 Unsupervised Learning: Finds hidden patterns or structures in unlabelled data (clustering,
dimensionality reduction).
 Reinforcement Learning: Learns through interactions with an environment and feedback (trial-
and-error learning).
 Semi-Supervised Learning: A hybrid method with a small amount of labelled data and a large
amount of unlabelled data.
 Self-Supervised Learning: A form of unsupervised learning that generates pseudo-labels from
the data itself.

Differences between supervised and unsupervised learning paradigms

Aspect Supervised Learning Unsupervised Learning


Supervised learning involves training a Unsupervised learning involves training a model
Definition model on labelled data (input-output pairs) on unlabelled data to discover hidden patterns
to predict outputs for new data. or structures.
Requires labelled data (both input features
Uses unlabelled data (only input features are
Data Type and corresponding output labels are
provided, no corresponding output labels).
provided).
The goal is to predict a specific output for The goal is to find patterns or structure in the
Objective
given inputs. data without predefined output labels.
Discrete class labels (for classification) or Clusters, groups, or reduced representations of
Output
continuous values (for regression). data (e.g., dimensions).
The model learns from labelled examples
Training The model learns by finding structure or
and uses them to predict outputs for new
Process relationships in the input data.
inputs.
Performance is measured by comparing the Performance evaluation is more challenging
predicted output with actual labels. Metrics since there are no true labels. Metrics like
Evaluation
like accuracy, precision, recall, and mean silhouette score and clustering quality are often
squared error (MSE) are used. used.
Supervision: Direct feedback based on
Feedback No supervision: The model must infer structure
labelled data (i.e., the model is told what
Type without guidance (no correct labels).
the correct output is).
Can be more challenging because the model has
Typically easier to train and evaluate due to
Complexity to figure out useful patterns without explicit
the availability of labelled data.
guidance.
- Classification: Predicting discrete labels - Clustering: Grouping similar data points (e.g.,
Examples of (e.g., spam detection, image classification).- customer segmentation).- Dimensionality
Tasks Regression: Predicting continuous values Reduction: Reducing the number of features
(e.g., stock price prediction). (e.g., PCA for feature extraction).
- Classification: Logistic Regression, Decision
Trees, Random Forests, Support Vector - Clustering: K-Means, DBSCAN, Hierarchical
Machines (SVM), Naive Bayes, Neural Clustering, Gaussian Mixture Models. -
Algorithms
Networks. - Regression: Linear Regression, Dimensionality Reduction: PCA (Principal
Polynomial Regression, Ridge/Lasso Component Analysis), t-SNE, Auto-encoders.
Regression.
Requires labelled data (with input-output Requires feature extraction and transformation
Data Pre-
pairs), and reprocessing might involve techniques like PCA, and often involves handling
processing
feature scaling or encoding labels. large datasets without labels.
Models often provide clear explanations Models may be harder to interpret since they
Model
based on input-output mappings (e.g., group data without explicit labels (e.g., cluster
Interpretability
decision boundaries in classification). centers or reduced dimensions).
- Email Spam Detection: Classify emails as - Customer Segmentation: Group customers
"spam" or "not spam."- Image Recognition: based on purchasing behaviour for targeted
Recognize objects in images.- Speech marketing.- Anomaly Detection: Identify fraud
Example Recognition: Convert spoken language to or outliers in financial transactions.- Topic
Applications text.- Medical Diagnosis: Predict disease Modelling: Discover underlying topics in large
based on medical data.- Sales Prediction: text corpora.- Recommender Systems: Identify
Forecast future sales based on historical hidden relationships between users and items
data. (e.g., collaborative filtering).
Assignment 3
(Submit within a week)

1) Define Machine Learning. Explain the need and its evolution.


2) Explain about Instance based learning with example.
3) Describe the features of Bayesian learning methods.
4) Write about Bayes theorem with example.
5) “Instance based learning is lazy learning”. Justify.
6) Explain about binary classification and related tasks.
7) Describe kNN Algorithm for data classification with appropriate example.
8) Compare and contrast Supervised Learning with Unsupervised Learning.
9) Discuss in detail about Naïve Bayes classification with relevant example.
10) Explain in detail about Gibbs Algorithm.
11) Explain about minimum description length principle.
12) Discuss about various applications of ML in industry and real world.

You might also like