AIML-Unit 3 Notes-Assignment 3
AIML-Unit 3 Notes-Assignment 3
UNIT- III
Bayesian and Computational Learning: Bayes theorem , concept learning, maximum likelihood,
minimum description length principle, Gibbs Algorithm, Naïve Bayes Classifier, Instance Based
Learning- K-Nearest neighbour learning
Introduction to Machine Learning (ML): Definition, Evolution, Need, applications of ML in industry
and real world, classification; differences between supervised and unsupervised learning paradigms.
Bayesian Learning
Bayesian learning is a statistical approach that applies Bayes' theorem to update the probability of
a hypothesis as new evidence or data becomes available. It combines prior knowledge (prior
probability) with new data (likelihood) to form a posterior probability, which gives a more refined
belief about the hypothesis. This approach allows for the incorporation of uncertainty in model
parameters, making it useful in situations where data is noisy or incomplete.
Computational Learning
Computational learning refers to the study and application of algorithms that enable machines to
improve their performance over time through experience or data. It involves the development of
models and algorithms that can learn from data, whether it's supervised, unsupervised, or
reinforcement learning. Computational learning focuses on the efficiency and scalability of learning
algorithms, often involving optimization techniques, data structures, and statistical methods.
Bayes theorem
Bayes’ Theorem is used to determine the conditional probability of event.
It is used to find the probability of an event, based on prior knowledge of conditions that
might be related to that event.
Terms Related to Bayes Theorem: Some important terms related to the concepts covered in
formula.
Hypotheses: Events happening in the sample space E1, E2,… En is called the hypotheses
Priori Probability: Priori Probability is the initial probability of an event occurring before any new
data is taken into account. P(Ei) is the priori probability of hypothesis Ei.
Posterior Probability: Posterior Probability is the updated probability of an event after
considering new information. Probability P(Ei|A) is considered as the posterior probability of
hypothesis Ei.
Bayes Theorem Statement
Let E1, E2,…, En be a set of events associated with the sample space S, in which all the events E1, E2,…,
En have a non-zero probability of occurrence. All the events E1, E2,…, E form a partition of S. Let A be
an event from space S for which we have to find probability, then according to Bayes’ theorem,
for k = 1, 2, 3, …., n
where,
• P(A) and P(B) are the probabilities of events A and B also P(B) is never equal to zero.
• P(A|B) is the probability of event A when event B happens
• P(B|A) is the probability of event B when A happens
Example Scenario:
Imagine there's a medical test for a disease, and we want to know the probability that someone
actually has the disease, given that they tested positive for it.
Given:
P(Disease) = Probability that a person has the disease = 0.1 (10% of the population has the disease)
P(No Disease) = Probability that a person does not have the disease = 0.9 (90% of the population
does not have the disease)
P(Test Positive | Disease) = Probability of testing positive if you have the disease = 0.95 (95% true
positive rate)
P(Test Positive | No Disease) = Probability of testing positive if you do not have the disease (false
positive rate) = 0.05 (5% false positive rate)
Step-by-step Calculation:
P(TestPositive)= P(TestPositive∣Disease)⋅P(Disease)+P(TestPositive∣NoDisease)⋅P(NoDisease)
P(TestPositive) = (0.95⋅0.1)+(0.05⋅0.9)=0.095+0.045=0.14
P(Disease | Test Positive) = 0.95⋅0.10.14=0.0950.14≈0.679
Interpretation:
The probability that a person actually has the disease, given that they tested positive, is about
67.9%.
Even though the test is fairly accurate (95% true positive rate), the relatively high false positive
rate (5%) and the low initial probability of having the disease (10%) still make it somewhat
unlikely that someone who tests positive actually has the disease.
Concept Learning
It is the process of learning to categorize objects, events, or ideas based on shared features or
attributes.
The goal of concept learning is to recognize patterns that define a particular class or category
and use those patterns to make decisions about new, unseen instances.
Concept learning is an essential part of how we categorize the world around us.
It allows us to use prior knowledge and observed features to group similar items together and
make decisions about new items.
This process is used widely in machine learning (e.g., image recognition), cognitive science, and
everyday decision-making.
Key terms:
Concept: A class or category of objects that share common features.
Features: The attributes or properties that define the concept.
Generalization: Learning a concept by identifying the common features across different
examples.
Classification: Assigning new instances to a category based on the learned concept.
Example:
Let's say you are learning the concept of "Fruit" and are given a set of examples that include apple,
banana, and carrot. You must determine which features are shared by the fruits.
Step 1: Observation of Examples
o Apple: A round, sweet, red or green fruit.
o Banana: A long, sweet, yellow fruit.
o Carrot: A long, orange vegetable, not sweet in the same way as fruits.
Step 2: Identifying Common Features
o Fruit features:
o Generally sweet taste (applies to apple and banana)
o Edible, grows on trees or plants
o Mostly have seeds inside (apple and banana)
o Non-fruit (e.g., carrot):
o Not as sweet
o Grows underground
o Does not have seeds in the same sense as fruits
Step 3: Generalizing
o Based on these observations, you can generalize the concept of "fruit" as a sweet, edible
item that typically grows on a tree or plant and contains seeds inside.
o With this concept in mind, you would classify apples and bananas as fruits and carrots as
not a fruit.
Key Idea:
In simple terms, Maximum Likelihood asks:
What parameters of the model maximize the likelihood of observing the given data?
Steps of Maximum Likelihood Estimation:
Define the likelihood function: This function expresses the probability of observing the data
given a set of parameters.
Maximize the likelihood: Find the parameter values that maximize this function.
Figure shows multiple attempts at fitting
the Parametric Density Estimation bell curve over
the random sample data.
Red bell curves indicate poorly fitted Probability
Density Function and the green bell curve shows
the best fitting Parametric Density
Estimation over the data.
Obtained the optimum bell curve by checking the
values in Maximum Likelihood Estimate plot
corresponding to each Parametric Density
Estimation.
Red plots poorly fit the normal distribution, hence
their ‘maximum likelihood estimate’ is also lower.
The green PDF curve has the maximum likelihood
estimate as it fits the data perfectly.
This is how the maximum likelihood estimate
method works.
Probability Density Function (PDF) tells us how likely different outcomes are for a continuous
variable, while Maximum Likelihood Estimation helps us find the best-fitting model for the
data we observe.
Maximum Likelihood Estimation assumes that all Probability Density Functions are a likely
candidate to being the best fitting curve. Hence, it is computationally expensive method.
Minimum Description Length (MDL) Principle
MDL principle is a concept in information theory and machine learning that aims to find the
model that best explains the data while also keeping the model as simple as possible.
It is a way of balancing model fit and complexity to avoid over-fitting.
The core idea behind MDL is that the best model is the one that allows you to describe both
the data and the model itself using the shortest possible total description length.
The description length of a model consists of two parts:
• The length of the model: This refers to the complexity or the number of parameters of
the model.
• The length of the data given the model: This refers to how well the model compresses
or explains the data.
MDL principle is a model selection criterion that helps in choosing the best model among a set of
candidate models, aiming to balance model complexity and fit to the data.
It is based on the idea that the best model is the one that allows for the shortest total
description of both the model and the data it represents. In other words, it chooses the model
that provides the most compact explanation of the data.
The MDL principle is rooted in information theory and statistical inference, and it can be thought
of as an extension of Occam’s Razor: the simpler the model, the better, as long as it fits the
data well.
In MDL,
the goal is to choose the model that minimizes the total description length, which is a tradeoff
between:
• How well the model fits the data.
• The complexity of the model itself.
By following the MDL principle, over-fitting (choosing overly complex models) can be
avoided while still capturing the structure in the data.
Applications of minimum description length principle
Important applications across various fields, particularly in machine learning, data
compression, and statistical modeling.
MDL principle is a powerful tool for balancing model accuracy with simplicity across many domains.
It can be used for tasks like model selection, feature selection, data compression, decision tree
pruning, time series forecasting, and even neural network design, helping to prevent over-fitting
and ensuring that models generalize well to new, unseen data.
Over-fitting
Over-fitting occurs when a model is too complex and learns the details or noise in the training data
to the extent that it negatively impacts the performance of the model on new, unseen data.
Essentially, the model "memorizes" the training data instead of generalizing from it.
Under-fitting
Under-fitting happens when a model is too simple to capture the underlying structure of the data.
It fails to learn the patterns in the training data adequately and, as a result, has poor performance
both on the training data and on unseen data.
Gibbs Algorithm
Gibbs Sampling algorithm is a Markov Chain Monte Carlo (MCMC) technique that is used to
generate samples from complex multivariate distributions.
It is especially helpful when you have a joint distribution that is difficult to sample from directly
but where conditional distributions of individual variables are easier to sample from.
The idea behind Gibbs sampling is to iteratively update the values of each variable in the model
by sampling from its conditional distribution, given the current values of the other variables.
The Problem: Sampling from a Multivariate Distribution
In many problems, especially in Bayesian inference, we often need to sample from a multivariate
distribution P(X1,X2,...,Xn).
However, this joint distribution is often complex, making direct sampling infeasible.
Gibbs sampling algorithm addresses this by breaking the joint distribution into conditional
distributions.
The central idea of Gibbs sampling is that although we may not be able to directly sample from
the joint distribution, we can sample from the conditional distribution of each variable, given
the current values of the other variables.
Gibbs sampling is a versatile and simple algorithm with many practical applications, its efficiency
can be affected by convergence issues, autocorrelation, and difficulty in sampling from certain
conditionals. It's important to assess the nature of the problem at hand to determine if Gibbs
sampling is the most suitable approach.
Assumption
The fundamental Naive Bayes assumption is that each feature makes an:
Feature independence: This means that when we are trying to classify something, we assume
that each feature (or piece of information) in the data does not affect any other feature.
Continuous features are normally distributed: If a feature is continuous, then it is assumed to
be normally distributed within each class.
Discrete features have multinomial distributions: If a feature is discrete, then it is assumed to
have a multinomial distribution within each class.
Features are equally important: All features are assumed to contribute equally to the prediction
of the class label.
No missing data: The data should not contain any missing values.
Advantages of Naïve Bayes Classifier
1. Simple and Easy to Implement:
o Naïve Bayes is easy to understand and implement. It is especially useful for beginners in
machine learning because of its simplicity.
2. Fast and Efficient:
o It is computationally efficient, making it ideal for large datasets. The training phase is
very fast since it only requires the calculation of probabilities.
3. Works Well with Small Datasets:
o Naïve Bayes often performs well with small datasets, especially when there is a clear
distinction between classes.
4. Handles Multi-class Problems:
o Naïve Bayes can be used for classification problems with more than two classes, making
it versatile for many applications.
5. Good Performance with High-dimensional Data:
o It performs well in high-dimensional spaces, like text classification tasks (e.g., spam
filtering or sentiment analysis), because each feature is treated independently.
6. Works Well with Categorical and Continuous Data:
o The algorithm can be used with both categorical and continuous data. For continuous
data, it assumes that the features follow a Gaussian distribution.
7. Robust to Irrelevant Features:
o Naïve Bayes can tolerate irrelevant features to some extent because it relies on
conditional independence between features, so irrelevant features have little impact on
the model’s performance.
8. Probabilistic Interpretation:
o It provides the probability of each class, which is valuable when you need to make
decisions based on the likelihood of an event.
Naïve Bayes is a great choice for quick, probabilistic classification problems, especially when dealing
with text data or problems where features are relatively independent. Its speed and simplicity make
it suitable for real-time applications, but its assumptions of independence and distribution might
hinder performance in more complex scenarios.
Key Concepts:
Storage of Instances:
The algorithm stores the entire training dataset (or selected examples) in memory.
No explicit model is built during training.
Similarity Measure:
When a new instance is encountered, it is compared to the stored instances using a distance or
similarity metric (e.g., Euclidean distance, cosine similarity).
Prediction: The prediction for a new instance is based on the most similar stored instances. Common
strategies include:
• k-Nearest Neighbors (k-NN): The class of the majority of the k-nearest neighbors determines the
prediction for classification tasks. For regression tasks, it’s often the average or weighted
average of the k-nearest neighbors’ outputs.
• Case-Based Reasoning: Similar instances are retrieved and adapted to make predictions or
decisions, often used in expert systems.
Advantages:
• Simple and Intuitive: IBL is easy to understand and implement.
• No Training Phase: It doesn’t require complex training like other algorithms, since it simply
stores the data.
Disadvantages:
• High Memory Usage: It stores all instances, which can become computationally expensive with
large datasets.
• Slow Prediction Time: Since predictions are based on comparisons with stored instances, it can
be slower compared to other models, especially with large datasets.
IBL is particularly useful for problems where the data is sparse or non-linear, or where it is difficult to
build an explicit model, like in certain types of classification, regression, or recommendation
systems.
Steps:
Collect customer data such as age, spending patterns, purchase frequency, and product
categories bought.
Choose K (e.g., K = 3).
For a new customer, calculate the distance between this new customer's data and all other
customers in the training dataset using a distance metric like Euclidean distance.
Identify the K nearest neighbors (most similar customers).
Assign the customer to the most frequent segment among the K nearest neighbors.
Result: The new customer will be classified into a customer segment (e.g., high spender, bargain
shopper).
Advantages of K-NN
Simple to understand and implement.
No training phase: It's a lazy learner, meaning it doesn't require an explicit training phase.
Works well with smaller datasets and when the decision boundary is non-linear.
Disadvantages of K-NN
Computationally expensive: It needs to compute distances for all data points during prediction,
which can be slow for large datasets.
Sensitive to irrelevant features and noisy data: The presence of irrelevant or noisy features can
significantly affect the performance of K-NN.
Choice of K: The algorithm's performance is highly sensitive to the choice of K.
Classification
Classification teaches a machine to sort things into categories.
It learns by looking at examples with labels (like emails marked “spam” or “not spam”).
After learning, it can decide which category new items belong to, like identifying if a new email is
spam or not.
In simpler terms, classification involves assigning input data into predefined categories or
classes.
If the quality metric is not satisfactory, the ML algorithm or hyper parameters can be adjusted,
and the model is retrained.
This iterative process continues until a satisfactory performance is achieved.
In short, classification in machine learning is all about using existing labelled data to teach the
model how to predict the class of new, unlabelled data based on the patterns it has learned.
ML Classification
Supervised Learning:
Definition: In supervised learning, the algorithm learns from labelled training data. The goal is to
map input data to the correct output using examples that have known results.
Process: The algorithm is provided with pairs of inputs and outputs, and it learns to predict the
output for unseen inputs as shown in figure below.
Applications:
Classification: Categorizing data into predefined labels (e.g., email spam detection,
image recognition).
Regression: Predicting continuous values (e.g., house price prediction, stock market
forecasting).
Examples of Algorithms:
Linear Regression, Logistic Regression
Decision Trees, Random Forests
Support Vector Machines (SVM)
Neural Networks
Unsupervised Learning:
Definition: In unsupervised learning, the algorithm works with data that does not have labelled
outputs. The system tries to find patterns or structure within the data without prior knowledge
of what the output should be.
Process: The algorithm identifies hidden structures or relationships in the input data, such as
clustering similar data points or reducing dimensionality and shown in figure below.
Applications:
Clustering: Grouping similar data points (e.g., customer segmentation, anomaly
detection).
Dimensionality Reduction: Reducing the number of features while maintaining essential
information (e.g., principal component analysis - PCA).
Examples of Algorithms:
K-Means Clustering, Hierarchical Clustering
Principal Component Analysis (PCA)
DBSCAN, Gaussian Mixture Models
Reinforcement Learning:
Definition: In reinforcement learning (RL), the algorithm learns by interacting with an
environment. The agent (the model) takes actions and receives feedback in the form of rewards
or penalties, based on the outcomes of those actions.
Process: The goal is to learn a policy that maximizes the cumulative reward over time. RL is used
when the correct actions are not explicitly known, and the agent must learn through trial and
error and is shown in the figure below.
Applications:
Game Playing: Algorithms like AlphaGo or chess engines.
Robotics: Teaching robots how to move or manipulate objects.
Self-Driving Cars: Making driving decisions based on environmental interactions.
Examples of Algorithms:
Q-learning, Deep Q-Network (DQN)
Policy Gradient Methods, Actor-Critic Methods
Semi-Supervised Learning:
Definition: Semi-supervised learning is a mix of supervised and unsupervised learning, where the
algorithm is provided with a small amount of labelled data and a large amount of unlabelled
data. It uses the labelled data to guide the learning of patterns from the unlabelled data.
Applications: This method is used when labelled data is scarce or expensive to obtain, but there
is a lot of unlabelled data available.
Examples of Algorithms: Semi-supervised SVM, Graph-based methods.
Self-Supervised Learning:
Definition: A form of unsupervised learning where the model learns to predict part of the input
data using other parts. It's a method that generates pseudo-labels from the data itself, and it's
often used in natural language processing and computer vision.
Applications: Commonly used for pre-training models that can later be fine-tuned with labelled
data (e.g., BERT for NLP, contrastive learning in computer vision).
Examples of Algorithms: Contrastive Learning, Auto-encoders
Summary of ML Classification:
Supervised Learning: Uses labelled data to learn and predict outputs (classification, regression).
Unsupervised Learning: Finds hidden patterns or structures in unlabelled data (clustering,
dimensionality reduction).
Reinforcement Learning: Learns through interactions with an environment and feedback (trial-
and-error learning).
Semi-Supervised Learning: A hybrid method with a small amount of labelled data and a large
amount of unlabelled data.
Self-Supervised Learning: A form of unsupervised learning that generates pseudo-labels from
the data itself.