AI Unit-4
AI Unit-4
RCA-403
Syllabus
UNIT-I INTRODUCTION: - Introduction to Artificial Intelligence, Foundations and
History of Artificial Intelligence, Applications of Artificial Intelligence, Intelligent
Agents, Structure of Intelligent Agents. Computer vision, Natural Language
Possessing.
UNIT-II INTRODUCTION TO SEARCH: - Searching for solutions, uniformed
search strategies, informed search strategies, Local search algorithms and optimistic
problems, Adversarial Search, Search for Games, Alpha - Beta pruning.
UNIT-III KNOWLEDGE REPRESENTATION & REASONING: - Propositional
logic, Theory of first order logic, Inference in First order logic, Forward &
Backward chaining, Resolution, Probabilistic reasoning, Utility theory, Hidden
Markov Models (HMM), Bayesian Networks.
Syllabus
UNIT-IV MACHINE LEARNING: - Supervised and unsupervised learning,
Decision trees, Statistical learning models, learning with complete data -
Naive Bayes models, Learning with hidden data – EM algorithm,
Reinforcement learning.
What is learning?
• Learning is the process of gathering information and knowledge from past
experience data analysis and apply this information to enhance the system
performance.
• Learning represents changes in a system that, make a system to do the same
task more efficiently next time
Machine learning, a branch of artificial intelligence, concerns the construction and
study of systems that can learn from data. For example, a machine learning system
could be trained on email messages to learn to distinguish between spam and non-
spam messages. After learning, it can then be used to classify new email messages
into spam and non-spam folders.
Machine Learning
• The process of learning begins with observations or data, such as examples, direct
experience, or instruction, in order to look for patterns in data and make better
decisions in the future based on the examples that we provide. The primary aim is
to allow the computers learn automatically without human intervention or
assistance and adjust actions accordingly.
21-01-2025 9
Machine Learning Methods
21-01-2025 10
21-01-2025 11
Supervised machine learning
algorithms
• Supervised machine learning algorithms can apply what has been
learned in the past to new data using labeled examples to predict future
events. Starting from the analysis of a known training dataset, the learning
algorithm produces an inferred function to make predictions about the
output values. The system is able to provide targets for any new input after
sufficient training. The learning algorithm can also compare its output with
the correct, intended output and find errors in order to modify the model
accordingly.
21-01-2025 12
21-01-2025 13
Machine Learning: Unsupervised ML
21-01-2025 15
21-01-2025 16
Machine Learning: Supervised vs. Unsupervised ML
Machine Learning: Reinforcement ML
You can’t apply reinforcement learning model is all the situation. Here
are some conditions when you should not use reinforcement learning
model.
• When you have enough data to solve the problem with a supervised
learning method
• You need to remember that Reinforcement Learning is computing-
heavy and time-consuming. in particular when the action space is
large.
Applications of Reinforcement Learning
Here are the major challenges you will face while doing Reinforcement
earning:
• Feature/reward design which should be very involved
• Parameters may affect the speed of learning.
• Realistic environments can have partial observability.
• Too much Reinforcement may lead to an overload of states which
can diminish the results.
• Realistic environments can be non-stationary.
Reinforcement Learning vs. Supervised Learning
• What we’ve seen above is an example of classification tree, where the outcome was a variable like ‘fit’ or
‘unfit’. Here the decision variable is Categorical.
• Here the decision or the outcome variable is Continuous, e.g. a number like 123. Working Now that we
know what a Decision Tree is, we’ll see how it works internally. There are many algorithms out there which
construct Decision Trees, but one of the best is called as ID3 Algorithm. ID3 Stands for Iterative
Dichotomiser 3. Before discussing the ID3 algorithm, we’ll go through few definitions. Entropy Entropy,
also called as Shannon Entropy is denoted by H(S) for a finite set S, is the measure of the amount of
uncertainty or randomness in data.
• Intuitively, it tells us about the predictability of a certain event. Example, consider a coin toss whose
probability of heads is 0.5 and probability of tails is 0.5. Here the entropy is the highest possible, since
there’s no way of determining what the outcome might be. Alternatively, consider a coin which has heads
on both the sides, the entropy of such an event can be predicted perfectly since we know beforehand that
it’ll always be heads. In other words, this event has no randomness hence it’s entropy is zero
• Information Gain Information gain is also called as Kullback-Leibler divergence denoted by IG(S,A) for a
set S is the effective change in entropy after deciding on a particular attribute A. It measures the relative
change in entropy with respect to the independent variables.
where IG(S, A) is the information gain by applying feature A. H(S) is the Entropy of the entire set, while
the second term calculates the Entropy after applying the feature A,
• Let’s understand this with the help of an example Consider a piece of data collected over the course of 14 days where
the features are Outlook, Temperature, Humidity, Wind and the outcome variable is whether Golf was played on the day.
Now, our job is to build a predictive model which takes in above 4 parameters and predicts whether Golf will be played
on the day. We’ll build a decision tree to do that using ID3 algorithm.
5. For each attribute, calculate the entropy with respect to the attribute ‘x’ denoted by H(S, x)
7. Remove the attribute that offers highest IG from the set of attributes
8. Repeat until we run out of all attributes, or the decision tree has all leaf nodes.
Statistical learning models
• Statistical learning theory is a framework for machine learning drawing from the
fields of statistics and functional analysis. Statistical learning theory deals with the
problem of finding a predictive function based on data.
• Statistical learning focuses on calculating the probabilities of each hypothesis and
make predictions accordingly.
• Statistical learning theory has led to successful applications in fields such as
computer vision, speech recognition, bioinformatics etc.
• Maximum likelihood estimation (MLE) is a method of estimating the parameters
of a statistical model so the observed data is most probable. MLE attempts to find
the parameter values that maximize the likelihood function, given the
observations. The resulting estimate is called a maximum likelihood estimate,
which is also abbreviated as MLE.
• Machine learning is all about results, it is likely working in a company
where your worth is characterized solely by your performance.
• Whereas, statistical modeling is more about finding relationships
between variables and the significance of those relationships, whilst (
at the same time) also catering for prediction.
Naïve Bayes’ Model
Naïve Baye’s Algorithm is the algorithm that learns the probability of an object
with certain features belonging to a particular group/ class. Bayes’ is after the name
of statistician and philosopher, Thomas Bayes and the theorem named “Bayes
Theorem”, which is the base of Naïve Bayes Model. More formally, Bayes’
Theorem is stated as the following equation:
𝑃 𝐵 𝐴 𝑃(𝐴)
𝑃 𝐴𝐵 =
𝑃(𝐵)
Where,
• P(A|B): probability (conditional probability) of occurrence of event A given the
event B is true.
• P(A) and P(B): probabilities of occurrence of event A and B respectively.
• P(B|A): Probability of occurrence of event B given the event A is true.
• For example, a fruit may be considered to be an apple if it is
red, round, and about 3 inches in diameter. Even if these
features depend on each other or upon the existence of the
other features, all of these properties independently contribute
to the probability that this fruit is an apple and that is why it is
known as ‘Naive’.
Naïve Bayes’ Model
Above,
•P(c|x) is the posterior probability of class (c, target) given predictor (x, attributes).
•P(c) is the prior probability of class.
•P(x|c) is the likelihood which is the probability of predictor given class.
•P(x) is the prior probability of predictor.
How Naive Bayes algorithm works?
Pros:
• It is easy and fast to predict class of test data set. It also
perform well in multi class prediction
• When assumption of independence holds, a Naive Bayes
classifier performs better compare to other models like logistic
regression and you need less training data.
• It perform well in case of categorical input variables compared
to numerical variable(s). For numerical variable, normal
distribution is assumed (bell curve, which is a strong
assumption).
Cons:
•If categorical variable has a category (in test data set), which was not observed in
training data set, then model will assign a 0 (zero) probability and will be unable to
make a prediction. This is often known as “Zero Frequency”. To solve this, we can use
the smoothing technique. One of the simplest smoothing techniques is called Laplace
estimation.
•On the other side naive Bayes is also known as a bad estimator, so the probability
outputs from predict_proba are not to be taken too seriously.
•Another limitation of Naive Bayes is the assumption of independent predictors. In real
life, it is almost impossible that we get a set of predictors which are completely
independent.
Applications of Naive Bayes Algorithms
• Real time Prediction: Naive Bayes is an eager learning classifier and it is sure fast. Thus, it could be used for making
predictions in real time.
• Multi class Prediction: This algorithm is also well known for multi class prediction feature. Here we can predict the
probability of multiple classes of target variable.
• Text classification/ Spam Filtering/ Sentiment Analysis: Naive Bayes classifiers mostly used in text classification (due to
better result in multi class problems and independence rule) have higher success rate as compared to other algorithms. As a
result, it is widely used in Spam filtering (identify spam e-mail) and Sentiment Analysis (in social media analysis, to identify
positive and negative customer sentiments)
• Recommendation System: Naive Bayes Classifier and Collaborative Filtering together builds a Recommendation System
that uses machine learning and data mining techniques to filter unseen information and predict whether a user would like a
given resource or not
Learning with hidden data - EM algorithm
• In the real-world applications of machine learning, it is very common that there
are many relevant features available for learning but only a small subset of them
are observable.
• So, for the variables which are sometimes observable and sometimes not, then we
can use the instances when that variable is observed for the purpose of learning
and then predict its value in the instances when it is not observable.
• On the other hand, Expectation-Maximization algorithm can be used for the latent
variables (variables that are not directly observable and are actually inferred from
the values of the other observed variables) too in order to predict their values with
the condition that the general form of probability distribution governing those latent
variables is known to us.
• This algorithm is actually at the base of many unsupervised clustering algorithms
in the field of machine learning.
• It was explained, proposed and given its name in a paper published in 1977 by
Arthur Dempster, Nan Laird, and Donald Rubin. It is used to find the local
maximum likelihood parameters of a statistical model in the cases where latent
variables are involved and the data is missing or incomplete.
Algorithm:
• Given a set of incomplete data, consider a set of starting parameters.
• Expectation step (E – step): Using the observed available data of the dataset,
estimate (guess) the values of the missing data.
• Maximization step (M – step): Complete data generated after the expectation (E)
step is used in order to update the parameters.
• Repeat step 2 and step 3 until convergence.
• The essence of Expectation-Maximization algorithm is to use the available
observed data of the dataset to estimate the missing data and then using that data to
update the values of the parameters. Let us understand the EM algorithm in detail.
• Initially, a set of initial values of the parameters are considered. A set of incomplete
observed data is given to the system with the assumption that the observed data
comes from a specific model.
• The next step is known as “Expectation” – step or E-step. In this step, we use the
observed data in order to estimate or guess the values of the missing or incomplete
data. It is basically used to update the variables.
• The next step is known as “Maximization”-step or M-step. In this step, we use the
complete data generated in the preceding “Expectation” – step in order to update
the values of the parameters. It is basically used to update the hypothesis.
• Now, in the fourth step, it is checked whether the values are converging or not, if
yes, then stop otherwise repeat step-2 and step-3 i.e. “Expectation” – step and
“Maximization” – step until the convergence occurs
Usage of EM algorithm
• It can be used to fill the missing data in a sample.
• It can be used as the basis of unsupervised learning of clusters.
• It can be used for the purpose of estimating the parameters of Hidden Markov
Model (HMM).
• It can be used for discovering the values of latent variables.
Advantages of EM algorithm
• It is always guaranteed that likelihood will increase with each iteration.
• The E-step and M-step are often pretty easy for many problems in terms of
implementation.
• Solutions to the M-steps often exist in the closed form.
Disadvantages of EM algorithm
• It has slow convergence.
• It makes convergence to the local optima only.
• It requires both the probabilities, forward and backward (numerical optimization
requires only forward probability)
Learning with hidden data - EM algorithm
How it works?
From the given data, EM learns a theory which tells that how much example to be
classified and how to predict the feature value of each class. From this it starts from
random classify data and repeat the two steps until a clear result is formed.
1. E-step: classify the data using current theory i.e., E-step generates expected
classification for each example.
2. M-step: generate the best theory using current classification of data i.e., M-
step generates most likely theory with given the classified data.