Machine Learning: B. Tech Iii Year - I Sem
Machine Learning: B. Tech Iii Year - I Sem
(R18)
Machine Learning
Lecture Notes
Prepared by
Mrs.Swapna
( Professor&HOD-CSM)
Dept. CSE(AIML)
Course Objectives:
This course explains machine learning techniques such as decision tree learning, Bayesian
learning etc.
To understand computational learning theory.
To study the pattern comparison techniques.
Course Outcomes:
Understand the concepts of computational intelligence like machine learning
Ability to get the skill to apply machine learning techniques to address the real time problems
in different areas
Understand the Neural Networks and its usage in machine learning application.
UNIT - I
Introduction - Well-posed learning problems, designing a learning system, Perspectives and issues in
machine learning
Concept learning and the general to specific ordering – introduction, a concept learning task, concept
learning as search, find-S: finding a maximally specific hypothesis, version spaces and the candidate
elimination algorithm, remarks on version spaces and candidate elimination, inductive bias.
Decision Tree Learning – Introduction, decision tree representation, appropriate problems for decision
tree learning, the basic decision tree learning algorithm, hypothesis space search in decision tree
learning, inductive bias in decision tree learning, issues in decision tree learning.
UNIT - II
Artificial Neural Networks-1– Introduction, neural network representation, appropriate problems for
neural network learning, perceptions, multilayer networks and the back-propagation algorithm.
Artificial Neural Networks-2- Remarks on the Back-Propagation algorithm, An illustrative example:
face recognition, advanced topics in artificial neural networks.
Evaluation Hypotheses – Motivation, estimation hypothesis accuracy, basics of sampling theory, a
general approach for deriving confidence intervals, difference in error of two hypotheses, comparing
learning algorithms.
UNIT - III
Bayesian learning – Introduction, Bayes theorem, Bayes theorem and concept learning, Maximum
Likelihood and least squared error hypotheses, maximum likelihood hypotheses for predicting
probabilities, minimum description length principle, Bayes optimal classifier, Gibs algorithm, Naïve
Bayes classifier, an example: learning to classify text, Bayesian belief networks, the EM algorithm.
Computational learning theory – Introduction, probably learning an approximately correct hypothesis,
sample complexity for finite hypothesis space, sample complexity for infinite hypothesis spaces, the
Faculty Name : Mrs Swapna Subject Name :ML
mistake bound model of learning.
Instance-Based Learning- Introduction, k-nearest neighbour algorithm, locally weighted regression,
radial basis functions, case-based reasoning, remarks on lazy and eager learning.
UNIT- IV
Genetic Algorithms – Motivation, Genetic algorithms, an illustrative example, hypothesis space
search, genetic programming, models of evolution and learning, parallelizing genetic algorithms.
UNIT - V
Analytical Learning-1- Introduction, learning with perfect domain theories: PROLOG-EBG, remarks
on explanation-based learning, explanation-based learning of search control knowledge.
Analytical Learning-2-Using prior knowledge to alter the search objective, using prior knowledge to
augment search operators.
Combining Inductive and Analytical Learning – Motivation, inductive-analytical approaches to
learning, using prior knowledge to initialize the hypothesis.
TEXT BOOK:
1. Machine Learning – Tom M. Mitchell, - MGH
REFERENCE BOOK:
2. Machine Learning: An Algorithmic Perspective, Stephen Marshland, Taylor & Francis.
1959: In 1959, the term "Machine Learning" was first coined by Arthur Samuel.
The term Machine Learning was coined by Arthur Samuel in 1959, an American pioneer in the field of
computer gaming and artificial intelligence, and stated that it gives computers the ability to learn
without being explicitly programmed
.
Ever since computers were invented, we have wondered whether they might be made to learn.
A successful understanding of how to make computers learn would open up many new uses
of computers and new levels of competence and customization
1. Supervised learning
2. Unsupervised learning
3. Reinforcement learning
1. Supervised Learning
Supervised learning is a type of machine learning method in which we provide sample labeled data to
the machine learning system in order to train it, and on that basis, it predicts the output.
The goal of supervised learning is to map input data with the output data. The supervised learning is
based on supervision, and it is the same as when a student learns things in the supervision of the teacher.
The example of supervised learning
Supervised learning, as the name indicates, has the presence of a supervisor as a teacher. Basically
supervised learning is when we teach or train the machine using data that is well labeled. Which means
some data is already tagged with the correct answer. After that, the machine is provided with a new set of
examples(data) so that the supervised learning algorithm analyses the training data(set of training
examples) and produces a correct outcome from labeled data.
For instance, suppose you are given a basket filled with different kinds of fruits. Now the first step is to
train the machine with all different fruits one by one like this:
If the shape of the object is rounded and has a depression at the top, is red in color, then it will be
labeled as –Apple.
If the shape of the object is a long curving cylinder having Green-Yellow color, then it will be labeled as
–Banana.
Now suppose after training the data, you have given a new separate fruit, say Banana from the basket,
and asked to identify it.
Since the machine has already learned the things from previous data and this time has to use it wisely. It
will first classify the fruit with its shape and color and would confirm the fruit name as BANANA and put it
in the Banana category. Thus the machine learns the things from training data(basket containing fruits)
and then applies the knowledge to test data(new fruit).
Classification: A classification problem is when the output variable is a category, such as “Red” or
“blue” or “disease” and “no disease”.
Regression: A regression problem is when the output variable is a real value, such as “dollars” or
“weight”.
Supervised learning deals with or learns with “labeled” data. This implies that some data is already tagged
with the correct answer.
2) Unsupervised Learning
Unsupervised learning is a learning method in which a machine learns without any supervision.
when an algorithm learns from plain examples without any associated response, leaving to the
algorithm to determine the data patterns on its own. This type of algorithm tends to restructure the data
into something else, such as new features that may represent a class or a new series of un-correlated
values.
o Clustering
o Association
Structure the input data into new features or a group of objects with similar patterns
Reinforcement learning: When you present the algorithm with examples that lack labels, as in
unsupervised learning. However, you can accompany an example with positive or negative feedback
according to the solution the algorithm proposes comes under the category of Reinforcement learning,
which is connected to applications for which the algorithm must make decisions (so the product is
prescriptive, not just descriptive, as in unsupervised learning), and the decisions bear consequences. In
the human world, it is just like learning by trial and error.
Errors help you learn because they have a penalty added (cost, loss of time, regret, pain, and so on),
teaching you that a certain course of action is less likely to succeed than others. An interesting
example of reinforcement learning occurs when computers learn to play video games by themselves.
Reinforcement learning is a feedback-based learning method, in which a learning agent gets a reward for
each right action and gets a penalty for each wrong action. The agent learns automatically with these
feedbacks and improves its performance
Definition: A computer program is said to learn from experience E with respect to some class
of tasks T and performance measure P, if its performance at tasks in T, as measured by P,
improves with experience E.
Example: playing checkers.
E = the experience of playing many games of checkers
T = the task of playing checkers.
P = the probability that the program will win the next game
Examples
1. Checkers game: A computer program that learns to play checkers might improve its
performance as measured by its ability to win at the class of tasks involving playing
checkers games, through experience obtained by playing games against itself.
The first design choice is to choose the type of training experience from which the
system will learn.
The type of training experience available can have a significant impact on success or
failure of the learner.
There are three attributes which impact on success or failure of the learner
1. Whether the training experience provides direct or indirect feedback regarding the
choices made by the performance system.
Indirect training examples consisting of the move sequences and final outcomes of
various games played. The information about the correctness of specific moves early in
the game must be inferred indirectly from the fact that the game was eventually won or
lost.
Here the learner faces an additional problem of credit assignment, or determining the
degree to which each move in the sequence deserves credit or blame for the final
outcome. Credit assignment can be a particularly difficult problem because the game
can be lost even when early moves are optimal, if these are followed later by poor
moves.
Hence, learning from direct training feedback is typically easier than learning from
indirect feedback.
Alternatively, the learner might itself propose board states that it finds particularly
confusing and ask the teacher for the correct move.
The learner may have complete control over both the board states and (indirect) training
classifications, as it does when it learns by playing against itself with no teacher present.
3. How well it represents the distribution of examples over which the final system
performance P must be measured
If its training experience E consists only of games played against itself, there is a danger
that this training experience might not be fully representative of the distribution of
situations over which it will later be tested.
It is necessary to learn from a distribution of examples that is different from those on
which the final system will be evaluated.
2. Choosing the Target Function
The next design choice is to determine exactly what type of knowledge will be learned and
how this will be used by the performance program.
Let’s consider a checkers-playing program that can generate the legal moves from any board
state.
The program needs only to learn how to choose the best move from among these legal moves.
We must learn to choose among the legal moves, the most obvious choice for the type of
information to be learned is a program, or function, that chooses the best move for any given
board state.
ChooseMove : B→ M
which indicate that this function accepts as input any board from the set of legal board
states B and produces as output some move from the set of legal moves M.
14 UNIT 1 ML Faculty Name:Mrs Swapna
ChooseMove is a choice for the target function in checkers example, but this function
will turn out to be very difficult to learn given the kind of indirect training experience
available to our system
which denote that V maps any legal board state from the set B to some real value.
Intend for this target function V to assign higher scores to better board states. If the
system can successfully learn such a target function V, then it can easily use it to select
the best move from any current board position.
Let us define the target value V(b) for an arbitrary board state b in B, as follows:
If b is a final board state that is won, then V(b) = 100
If b is a final board state that is lost, then V(b) = -100
If b is a final board state that is drawn, then V(b) = 0
Let’s choose a simple representation - for any given board state, the function c will be
calculated as a linear combination of the following board features:
In order to learn the target function f we require a set of training examples, each describing a
specific board state b and the training value Vtrain(b) for b.
For instance, the following training example describes a board state b in which black has won
the game (note x2 = 0 indicates that red has no remaining pieces) and for which the target
function value Vtrain(b) is therefore +100.
1. Derive training examples from the indirect training experience available to the learner
2. Adjusts the weights wi to best fit these training examples
A simple approach for estimating training values for intermediate board states is to
assign the training value of Vtrain(b) for any intermediate board state b to be
(Successor(b))
Where ,
is the learner's current approximation to V
Successor(b) denotes the next board state following b for which it is again the
program's turn to move
Vtrain (Successor(b))
Several algorithms are known for finding weights of a linear function that minimize E.
One such algorithm is called the least mean squares, or LMS training rule. For each
observed training example it adjusts the weights a small amount in the direction that
reduces the error on this training example
LMS weight update rule :- For each training example (b, Vtrain(b))
se the current weights to calculate (b)
For each weight wi, update it as
Here ƞ is a small constant (e.g., 0.1) that moderates the size of the weight update.
1. The Performance System is the module that must solve the given performance task by
using the learned target function(s). It takes an instance of a new problem (new game)
as input and produces a trace of its solution (game history) as output.
2. The Critic takes as input the history or trace of the game and produces as output a set
of training examples of the target function
3. The Generalizer takes as input the training examples and produces an output
hypothesis that is its estimate of the target function. It generalizes from the specific
training examples, hypothesizing a general function that covers these examples and
other cases beyond the training examples.
4. The Experiment Generator takes as input the current hypothesis and outputs a new
problem (i.e., initial board state) for the Performance System to explore. Its role is to
pick new practice problems that will maximize the learning rate of the overall system.
Although machine learning is being used in every industry and helps organizations make more informed
and data-driven choices that are more effective than classical methodologies, it still has so many
problems that cannot be ignored. Here are some common issues in Machine Learning that professionals
face to inculcate ML skills and create an application from scratch.
The major issue that comes while using machine learning algorithms is the lack of quality as well as
quantity of data. Although data plays a vital role in the processing of machine learning algorithms, many
data scientists claim that inadequate data, noisy data, and unclean data are extremely exhausting the
machine learning algorithms. For example, a simple task requires thousands of sample data, and an
advanced task such as speech or image recognition needs millions of sample data examples. Further,
o Noisy Data- It is responsible for an inaccurate prediction that affects the decision as well as
accuracy in classification tasks.
o Incorrect data- It is also responsible for faulty programming and results obtained in machine
learning models. Hence, incorrect data may affect the accuracy of the results also.
o Generalizing of output data- Sometimes, it is also found that generalizing output data becomes
complex, which results in comparatively poor future actions.
As we have discussed above, data plays a significant role in machine learning, and it must be of good
quality as well. Noisy data, incomplete data, inaccurate data, and unclean data lead to less accuracy in
classification and low-quality results. Hence, data quality can also be considered as a major common
problem while processing machine learning algorithms.
To make sure our training model is generalized well or not, we have to ensure that sample training data
must be representative of new cases that we need to generalize. The training data must cover all cases
that are already occurred as well as occurring.
Further, if we are using non-representative training data in the model, it results in less accurate
predictions. A machine learning model is said to be ideal if it predicts well for generalized cases and
provides accurate decisions. If there is less training data, then there will be a sampling noise in the
model, called the non-representative training set. It won't be accurate in predictions. To overcome this, it
will be biased against one class or a group.
Hence, we should use representative data in training to protect against being biased and make accurate
predictions without any drift.
Overfitting is one of the most common issues faced by Machine Learning engineers and data scientists.
Whenever a machine learning model is trained with a huge amount of data, it starts capturing noise and
inaccurate data into the training data set. It negatively affects the performance of the model. Let's
20 UNIT 1 ML Faculty Name:Mrs Swapna
understand with a simple example where we have a few training data sets such as 1000 mangoes, 1000
apples, 1000 bananas, and 5000 papayas. Then there is a considerable probability of identification of an
apple as papaya because we have a massive amount of biased data in the training data set; hence
prediction got negatively affected.
Underfitting:
Under fitting is just the opposite of over fitting. Whenever a machine learning model is trained with fewer amounts
of data, and as a result, it provides incomplete and inaccurate data and destroys the accuracy of the machine
learning model. Under fitting occurs when our model is too simple to understand the base structure of the data, is
less in quantity.
As we know that generalized output data is mandatory for any machine learning model; hence, regular
monitoring and maintenance become compulsory for the same. Different results for different actions
require data change; hence editing of codes as well as resources for monitoring them also become
necessary.
A machine learning model operates under a specific context which results in bad recommendations and
concept drift in the model. Let's understand with an example where at a specific time customer is looking
for some gadgets, but now customer requirement changed over time but still machine learning model
showing same recommendations to the customer while customer expectation has been changed. This
incident is called a Data Drift. It generally occurs when new data is introduced or interpretation of data
changes. However, we can overcome this by regularly updating and monitoring data according to the
expectations.
Although Machine Learning and Artificial Intelligence are continuously growing in the market, still
these industries are fresher in comparison to others. The absence of skilled resources in the form of
manpower is also an issue. Hence, we need manpower having in-depth knowledge of mathematics,
science, and technologies for developing and managing scientific substances for machine learning.
8. Customer Segmentation
The machine learning process is very complex, which is also another major issue faced by machine
learning engineers and data scientists. However, Machine Learning and Artificial Intelligence are very
new technologies but are still in an experimental phase and continuously being changing over time.
There is the majority of hits and trial experiments; hence the probability of error is higher than expected.
Further, it also includes analyzing the data, removing data bias, training data, applying complex
mathematical calculations, etc., making the procedure more complicated and quite tedious.
Data Biasing is also found a big challenge in Machine Learning. These errors exist when certain
elements of the dataset are heavily weighted or need more importance than others. Biased data leads to
inaccurate results, skewed outcomes, and other analytical errors. However, we can resolve this error by
determining where data is actually biased in the dataset. Further, take necessary steps to reduce it.
This basically means the outputs cannot be easily comprehended as it is programmed in specific ways to
deliver for certain conditions. Hence, a lack of explainability is also found in machine learning
algorithms which reduce the credibility of the algorithms.
This issue is also very commonly seen in machine learning models. However, machine learning models
are highly efficient in producing accurate results but are time-consuming. Slow programming, excessive
requirements' and overloaded data take more time to provide accurate results than expected. This needs
continuous maintenance and monitoring of the model for delivering accurate results.
Although machine learning models are intended to give the best possible outcome, if we feed garbage
data as input, then the result will also be garbage. Hence, we should use relevant features in our training
22 UNIT 1 ML Faculty Name:Mrs Swapna
sample. A machine learning model is said to be good if training data has a good set of features or less to
no irrelevant features.
CONCEPT LEARNING
Learning involves acquiring general concepts from specific training examples. Example:
People continually learn general concepts or categories such as "bird," "car," "situations in
which I should study more in order to pass the exam," etc.
Each such concept can be viewed as describing some subset of objects or events defined
over a larger set
Alternatively, each concept can be thought of as a Boolean-valued function defined over this
larger set. (Example: A function defined over all animals, whose value is true for birds and
false for other animals).
Consider the example task of learning the target concept "Days on which Aldo enjoys
his favorite water sport”
Table: Positive and negative training examples for the target concept EnjoySport.
The task is to learn to predict the value of EnjoySport for an arbitrary day, based on the
values of its other attributes?
If some instance x satisfies all the constraints of hypothesis h, then h classifies x as a positive
example (h(x) = 1).
The hypothesis that PERSON enjoys his favorite sport only on cold days with high humidity
is represented by the expression
(?, Cold, High, ?, ?, ?)
Notation
The set of items over which the concept is defined is called the set of instances, which is
denoted by X.
Example: X is the set of all possible days, each represented by the attributes: Sky, AirTemp,
Humidity, Wind, Water, and Forecast
The concept or function to be learned is called the target concept, which is denoted by c.
c can be any Boolean valued function defined over the instances X
c: X→ {O, 1}
Example: The target concept corresponds to the value of the attribute EnjoySport
(i.e., c(x) = 1 if EnjoySport = Yes, and c(x) = 0 if EnjoySport = No).
Instances for which c(x) = 1 are called positive examples, or members of the target concept.
Instances for which c(x) = 0 are called negative examples, or non-members of the target
concept.
The ordered pair (x, c(x)) to describe the training example consisting of the instance x and
its target concept value c(x).
The goal of the learner is to find a hypothesis h such that h(x) = c(x) for all x in X.
Given:
Instances X: Possible days, each described by the attributes
Sky (with possible values Sunny, Cloudy, and Rainy),
AirTemp (with values Warm and Cold),
Humidity (with values Normal and High),
Wind (with values Strong and Weak),
Water (with values Warm and Cool),
Forecast (with values Same and Change).
Determine:
A hypothesis h in H such that h(x) = c(x) for all x in X.
Any hypothesis found to approximate the target function well over a sufficiently large set of
training examples will also approximate the target function well over other unobserved
examples.
Concept learning can be viewed as the task of searching through a large space of
hypotheses implicitly defined by the hypothesis representation.
The goal of this search is to find the hypothesis that best fits the training examples.
Example:
Consider the instances X and hypotheses H in the EnjoySport learning task. The attribute Sky
has three possible values, and AirTemp, Humidity, Wind, Water, Forecast each have two
possible values, the instance space X contains exactly
3.2.2.2.2.2 = 96 distinct instances v
FIND-S Algorithm
To illustrate this algorithm, assume the learner is given the sequence of training examples
from the EnjoySport task
Observing the first training example, it is clear that hypothesis h is too specific. None
of the "Ø" constraints in h are satisfied by this example, so each is replaced by the next
more general constraint that fits the example
h1 = <Sunny Warm Normal Strong Warm Same>
The second training example forces the algorithm to further generalize h, this time
substituting a "?" in place of any attribute value in h that is not satisfied by the new
example
h2 = <Sunny Warm ? Strong Warm Same>
Upon encountering the third training the algorithm makes no change to h. The FIND-S
algorithm simply ignores every negative example.
h3 = < Sunny Warm ? Strong Warm Same>
Unanswered by FIND-S
Representation
Definition: version space- The version space, denoted V S with respect to hypothesis space
H, D
H and training examples D, is the subset of hypotheses from H consistent with the training
examples in D
V S {h H | Consistent (h, D)}
H, D
The version space is represented by its most general and least general members. These
members form general and specific boundary sets that delimit the version space within the
partially ordered hypothesis space.
Definition: The general boundary G, with respect to hypothesis space H and training data D,
is the set of maximally general members of H consistent with D
Definition: The specific boundary S, with respect to hypothesis space H and training data D,
is the set of minimally general (i.e., maximally specific) members of H consistent with D.
• If d is a negative example
• Moves from general to specific
• Remove from S any hypothesis inconsistent with d
• For each hypothesis g in G that is not consistent with d
• Remove g from G
• Add to G all minimal specializations h of g such that
• h is consistent with d, and some member of S is more specific than h
• Remove from G any hypothesis that is less general than another hypothesis in G
An Illustrative Example
After processing these four examples, the boundary sets S4 and G4 delimit the version space
of all hypotheses consistent with the set of incrementally observed training examples.
1. What if the target concept is not contained in the hypothesis space how the output is
predicted
2. Can we avoid this difficulty by using a hypothesis space that includes every possible
hypothesis by making more generalize
3. How does the size of this hypothesis space influence the ability of the algorithm to
generalize and make predictions
4. How does the size of the hypothesis space influence the number of training examples
that must be observed and how given hypothesis is correct prediction
Suppose the target concept is not contained in the hypothesis space H, then obvious
solution is to enrich the hypothesis space to include every possible hypothesis.
Consider the EnjoySport example in which the hypothesis space is restricted to include
only conjunctions of attribute values. Because of this restriction, the hypothesis space is
unable to represent even simple disjunctive target concepts such as
"Sky = Sunny or Sky = Cloudy."
The following three training examples of disjunctive hypothesis, the algorithm would
find that there are zero hypotheses in the version space
If Candidate Elimination algorithm is applied, then it end up with empty Version Space.
After first two training example
S= ? Warm Normal Strong Cool Change
This new hypothesis is overly general and it incorrectly covers the third negative
training example.
The solution to the problem of assuring that the target concept is in the hypothesis space H is
to provide a hypothesis space capable of representing every teachable concept that is
representing every possible subset of the instances X.
The set of all subsets of a set X is called the power set of X
In the EnjoySport learning task the size of the instance space X of days described by the
six attributes is 96 instances.
Thus, there are 296 distinct target concepts that could be defined over this instance space
and learner might be called upon to learn.
Example:
Consider the instances X and hypotheses H in the EnjoySport learning task. The attribute Sky
has three possible values, and AirTemp, Humidity, Wind, Water, Forecast each have two
possible values, the instance space X contains exactly
3.2.2.2.2.2 = 96 distinct instances v
Inductive Reasoning: Maximilian is a shelter dog. He is happy. All shelter dogs are happy.
Deductive Reasoning: Maximillian is a shelter dog. All shelter dogs are happy.
• Used to generate a decision tree from a given data set by employing a top-down,
greedy search, to test each attribute at every node of the tree.
It is a tool that has applications spanning several different areas. Decision trees can be used for
classification as well as regression problems. The name itself suggests that it uses a flowchart like a tree
structure to show the predictions that result from a series of feature-based splits. It starts with a root node
and ends with a decision made by leaves.
o Decision Tree is a Supervised learning technique that can be used for both classification and Regression
problems, but mostly it is preferred for solving Classification problems. It is a tree-structured classifier,
where internal nodes represent the features of a dataset, branches represent the decision
rules and each leaf node represents the outcome.
40 UNIT 1 ML Faculty Name:Mrs Swapna
o In a Decision tree, there are two nodes, which are the Decision Node and Leaf Node. Decision nodes are
used to make any decision and have multiple branches, whereas Leaf nodes are the output of those
decisions and do not contain any further branches.
o The decisions or the test are performed on the basis of features of the given dataset.
o It is a graphical representation for getting all the possible solutions to a problem/decision based on
given conditions.
o It is called a decision tree because, similar to a tree, it starts with the root node, which expands on further
branches and constructs a tree-like structure.
o Example: Suppose there is a candidate who has a job offer and wants to decide whether he should accept
the offer or Not. So, to solve this problem, the decision tree starts with the root node (Salary attribute by
ASM). The root node splits further into the next decision node (distance from the office) and one leaf node
based on the corresponding labels. The next decision node further gets split into one decision node (Cab
facility) and one leaf node. Finally, the decision node splits into two leaf nodes (Accepted offers and Declined
offer). Consider the below diagram:
Entropy: Entropy is a metric to measure the impurity in a given attribute. It specifies randomness in data. Entropy
can be calculated as:
Where,
While implementing a Decision tree, the main issue arises that how to select the best attribute for the root node and
for sub-nodes. So, to solve such problems there is a technique which is called as Attribute selection measure or
ASM. By this measurement, we can easily select the best attribute for the nodes of the tree through information
gain
Information Gain
INFORMATION GAIN MEASURES THE EXPECTED REDUCTION IN ENTROPY
Information Gain:
o Information gain is the measurement of changes in entropy after the segmentation of a dataset based on an
attribute.
o It calculates how much information a feature provides us about a class.
o According to the value of information gain, we split the node and build the decision tree.
o A decision tree algorithm always tries to maximize the value of information gain, and a node/attribute having
the highest information gain is split first. It can be calculated using the below formula:
Given a collection S, containing positive and negative examples of some target concept, the
entropy of S relative to this Boolean classification is
Where,
p+ is the proportion of positive examples in S
p- is the proportion of negative examples in S.
Decision trees classify instances by sorting them down the tree from the root to some
leaf node, which provides the classification of the instance.
Each node in the tree specifies a test of some attribute of the instance, and each branch
descending from that node corresponds to one of the possible values for this attribute.
An instance is classified by starting at the root node of the tree, testing the attribute
specified by this node, then moving down the tree branch corresponding to the value of
the attribute in the given example. This process is then repeated for the subtree rooted
at the new node.
An Illustrative Example
To illustrate the operation of ID3, consider the learning task represented by the training
examples of below table.
Here the target attribute PlayTennis, which can have values yes or no for different days.
Consider the first step through the algorithm, in which the topmost node of the decision
tree is created.
ID3 determines the information gain for each candidate attribute (i.e., Outlook,
Temperature, Humidity, and Wind), then selects the one with highest information gain.
Here, the attribute with maximum information gain is Outlook. So, the decision tree built so far -
According to the information gain measure, the Outlook attribute provides the best
prediction of the target attribute, PlayTennis, over the training examples. Therefore,
Outlook is selected as the decision attribute for the root node, and branches are created
below the root for each of its possible values i.e., Sunny, Overcast, and Rain.
finding the best attribute for splitting the data with Outlook=Sunny
Here, the attribute with maximum information gain is Humidity. So, the decision tree built so far -
Here, when Outlook = Sunny and Humidity = High, it is a pure class of category "no". And When
Outlook = Sunny and Humidity = Normal, it is again a pure class of category "yes". Therefore, we
don't need to do further calculations.
Here, the attribute with maximum information gain is Wind. So, the decision tree built so far -
1. ID3's hypothesis space of all decision trees is a complete space of finite discrete-valued
functions, relative to the available attributes. Because every finite discrete-valued
function can be represented by some decision tree
ID3 avoids one of the major risks of methods that search incomplete hypothesis spaces:
that the hypothesis space might not contain the target function.
For example, with the earlier version space candidate elimination method, which
maintains the set of all hypotheses consistent with the available training examples. But
ID3 maintains Single set of Hypothesis formed
3. ID3 in its pure form performs no backtracking in its search. Once it selects an attribute
to test at a particular level in the tree, it never backtracks to reconsider this choice.
In the case of ID3, corresponds to the decision tree it selects along the single search
path it explores.
4. ID3 uses all training examples at each step in the search to make statistically based
decisions regarding how to refine its current hypothesis.
Hypotheses proposed by
Occam's razor: is the problem-solving principle that the simplest solution tends to be
the right one. When presented with competing hypotheses to solve a problem, one
should select the solution with the fewest assumptions.
Occam's razor: “Prefer the simplest hypothesis that fits the data”.
Given a collection of training examples, there are typically many decision trees consistent with
these examples. Which of these decision trees does ID3 choose?
Approximate inductive bias of ID3: Shorter trees are preferred over larger trees
Consider an algorithm that begins with the empty tree and searches breadth first through
progressively more complex trees.
First considering all trees of depth 1, then all trees of depth 2, etc.
Once it finds a decision tree consistent with the training data, it returns the smallest
consistent tree at that search depth (e.g., the tree with the fewest nodes).
Let us call this breadth-first search algorithm BFS-ID3.
BFS-ID3 finds a shortest decision tree and thus exhibits the bias "shorter trees are
preferred over longer trees.
A closer approximation to the inductive bias of ID3: Shorter trees are preferred over longer
trees. Trees that place high information gain attributes close to the root are preferred over
those that do not.
Preference bias – The inductive bias of ID3 is a preference for certain hypotheses over others
(e.g., preference for shorter hypotheses over larger hypotheses), with no hard restriction on the
hypotheses that can be eventually enumerated. This form of bias is called a preference bias or
a search bias.
CANDIDATE-ELIMINATION Algorithm:
It searches this space completely, finding every hypothesis consistent with the training
data.
Its inductive bias is solely a consequence of the expressive power 2(^96) of its
hypothesis representation.
Restriction bias – The bias of the CANDIDATE ELIMINATION algorithm is in the form of a
categorical restriction on the set of hypotheses considered. This form of bias is typically called
a restriction bias or a language bias.
A preference bias is more desirable than a restriction bias, because it allows the learner
to work within a complete hypothesis space that is assured to contain the unknown target
function.
In contrast, a restriction bias that strictly limits the set of potential hypotheses is
generally less desirable, because it introduces the possibility of excluding the unknown
target function altogether.
The ID3 algorithm grows each branch of the tree just deeply enough to perfectly classify the
training examples but it can lead to difficulties when there is noise in the data, or when
the number of training examples is too small to produce a representative sample of the
true target function. This algorithm can produce trees that overfit the training examples.
The horizontal axis of this plot indicates the total number of nodes in the decision tree,
as the tree is being constructed. The vertical axis indicates the accuracy of predictions
made by the tree.
The solid line shows the accuracy of the decision tree over the training examples. The
broken line shows accuracy measured over an independent set of test example
The accuracy of the tree over the training examples increases monotonically as the tree
is grown. The accuracy measured over the independent test examples first increases,
then decreases.
How can it be possible for tree h to fit the training examples better than h', but for it to perform
more poorly over subsequent examples?
1. Overfitting can occur when the training examples contain random errors or noise
2. When small numbers of examples are associated with leaf nodes.
The impact of reduced-error pruning on the accuracy of the decision tree is illustrated in below
figure
The additional line in figure shows accuracy over the test examples as the tree is pruned.
When pruning begins, the tree is at its maximum size and lowest accuracy over the test
set. As pruning proceeds, the number of nodes is reduced and accuracy over the test set
increases.
The available data has been split into three subsets: the training examples, the validation
examples used for pruning the tree, and a set of test examples used to provide an
unbiased estimate of accuracy over future unseen examples. The plot shows accuracy
over the training and test sets.
Given the above rule, rule post-pruning would consider removing the preconditions
(Outlook = Sunny) and (Humidity = High)
It would select whichever of these pruning steps produced the greatest improvement in
estimated rule accuracy, then consider pruning the second precondition as a further
pruning step.
No pruning step is performed if it reduces the estimated rule accuracy.
The gain ratio measure penalizes attributes by incorporating a split information, that is sensitive
to how broadly and uniformly the attribute splits the data
The data which is available may contain missing values for some attributes
Example: Medical diagnosis
<Fever = true, Blood-Pressure = normal, …, Blood-Test = ?, …>
Sometimes values truly unknown, sometimes low priority (or cost too high)
In some learning tasks the instance attributes may have associated costs.
For example: In learning to classify medical diseases, the patients described in terms
ofattributes such as Temperature, BiopsyResult, Pulse, BloodTestResults, etc.
These attributes vary significantly in their costs, both in terms of monetary cost and
cost to patient comfort
Decision trees use low-cost attributes where possible, and depends only on high-
cost attributes only when needed to produce reliable and accurate classifications
The term "Artificial neural network" refers to a biologically inspired sub-field of artificial intelligence
modeled after the brain. An Artificial neural network is usually a computational network based on
biological neural networks that construct the structure of the human brain. Similar to a human brain has
neurons interconnected to each other, artificial neural networks also have neurons that are linked to each
other in various layers of the networks. These neurons are known as nodes.
The term "Artificial Neural Network" is derived from Biological neural networks that develop the
structure of a human brain. Similar to the human brain that has neurons interconnected to one another,
artificial neural networks also have neurons that are interconnected to one another in various layers of the
networks. These neurons are known as nodes.
The given figure illustrates the typical diagram of Biological Neural Network.
The typical Artificial Neural Network looks something like the given figure.
Dendrites from Biological Neural Network represent inputs in Artificial Neural Networks, cell nucleus
represents Nodes, synapse represents Weights, and Axon represents Output.
An Artificial Neural Network in the field of Artificial intelligence where it attempts to mimic the
network of neurons makes up a human brain so that computers will have an option to understand things and
make decisions in a human-like manner. The artificial neural network is designed by programming
computers to behave simply like interconnected brain cells.
There are around 1000 billion neurons in the human brain. Each neuron has an association point somewhere
in the range of 1,000 and 100,000. In the human brain, data is stored in such a manner as to be distributed,
and we can extract more than one piece of this data when necessary from our memory parallelly. We can
say that the human brain is made up of incredibly amazing parallel processors.
We can understand the artificial neural network with an example, consider an example of a digital logic gate
that takes an input and gives an output. "OR" gate, which takes two inputs. If one or both the inputs are
"On," then we get "On" in output. If both the inputs are "Off," then we get "Off" in output. Here the output
depends upon input. Our brain does not perform the same task. The outputs to inputs relationship keep
changing because of the neurons in our brain, which are "learning."
To understand the concept of the architecture of an artificial neural network, we have to understand what a
neural network consists of. In order to define a neural network that consists of a large number of artificial
neurons, which are termed units arranged in a sequence of layers. Lets us look at various types of layers
available in an artificial neural network.
As the name suggests, it accepts inputs in several different formats provided by the programmer.
Hidden Layer:
The hidden layer presents in-between input and output layers. It performs all the calculations to find hidden
features and patterns.
Output Layer:
The input goes through a series of transformations using the hidden layer, which finally results in output
that is conveyed using this layer.
The artificial neural network takes input and computes the weighted sum of the inputs and includes a bias.
This computation is represented in the form of a transfer function.
It determines weighted total is passed as an input to an activation function to produce the output. Activation
functions choose whether a node should fire or not. Only those who are fired make it to the output layer.
There are distinctive activation functions available that can be applied upon the sort of task we are
performing.
Artificial neural networks have a numerical value that can perform more than one task simultaneously.
After ANN training, the information may produce output even with inadequate data. The loss of
performance here relies upon the significance of missing data.
Extortion of one or more cells of ANN does not prohibit it from generating output, and this feature makes
the network fault-tolerance.
There is no particular guideline for determining the structure of artificial neural networks. The appropriate
network structure is accomplished through experience, trial, and error.
It is the most significant issue of ANN. When ANN produces a testing solution, it does not provide insight
concerning why and how. It decreases trust in the network.
Hardware dependence:
Artificial neural networks need processors with parallel processing power, as per their structure. Therefore,
the realization of the equipment is dependent.
ANNs can work with numerical data. Problems must be converted into numerical values before being
introduced to ANN. The presentation mechanism to be resolved here will directly impact the performance
of the network. It relies on the user's abilities.
The network is reduced to a specific value of the error, and this value does not give us optimum results.
PERCEPTRON
One type of ANN system is based on a unit called a perceptron. Perceptron is a single layer
neural network.
Where, each wi is a real-valued constant, or weight, that determines the contribution of input
xi to the perceptron output.
-w0 is a threshold that the weighted combination of inputs w1x1 + . . . + wnxn must surpass in
order for the perceptron to output a 1.
Perceptrons can represent all of the primitive Boolean functions AND, OR, NAND (~ AND), and
NOR (~OR)
Example: Representation of AND functions
The learning problem is to determine a weight vector that causes the perceptron to produce the correct
+ 1 or - 1 output for each of the given training examples.
Drawback:
The perceptron rule finds a successful weight vector when the training examples are linearly separable, it
can fail to converge if the examples are not linearly separable.
Weights w1 = 0.6, w2 = 0.6, Threshold = 1 and Learning Rate n = 0.5 are given
This is not greater than the threshold of 1, so the output = 0. Here the target is same as calculated output.
This is not greater than the threshold of 1, so the output = 0. Here the target does not match with calculated
output.
Now,
Weights w1 = 0.6, w2 = 1.1, Threshold = 1 and Learning Rate n = 0.5 are given
This is not greater than the threshold of 1, so the output = 0. Here the target is same as calculated output.
This is greater than the threshold of 1, so the output = 1. Here the target is same as calculated output.
This is not greater than the threshold of 1, so the output = 0. Here the target does not match with calculated
output.
Now,
Weights w1 = 1.1, w2 = 1.1, Threshold = 1 and Learning Rate n = 0.5 are given
This is not greater than the threshold of 1, so the output = 0. Here the target is same as calculated output.
This is greater than the threshold of 1, so the output = 0. Here the target is same as calculated output.
This is greater than the threshold of 1, so the output = 1. Here the target is same as calculated output.
This is greater than the threshold of 1, so the output = 1. Here the target is same as calculated output.
----------------------------------------------------------------------------------------------------------------------------------
Problem 2
Weights w1 = 1.2, w2 = 0.6, Threshold = 1 and Learning Rate n = 0.5 are given
This is not greater than the threshold of 1, so the output = 0, Here the target is same as calculated output.
This is not greater than the threshold of 1, so the output = 0. Here the target is same as calculated output.
This is greater than the threshold of 1, so the output = 1. Here the target does not match with the calculated
output.
Now,
After updating weights are w1 = 0.7, w2 = 0.6 Threshold = 1 and Learning Rate n = 0.5
This is not greater than the threshold of 1, so the output = 0. Here the target is same as calculated output.
This is not greater than the threshold of 1, so the output = 0. Here the target is same as calculated output.
This is not greater than the threshold of 1, so the output = 0. Here the target is same as calculated output.
This is greater than the threshold of 1, so the output = 1. Here the target is same as calculated output.
Hence the final weights are w1= 0.7 and w2 = 0.6, Threshold = 1 and Learning Rate n = 0.5.
------------------------------------------------------------------------------------------------------------------------------
ANN learning is well-suited to problems in which the training data corresponds to noisy,
complex sensor data, such as inputs from cameras and microphones.
Multi-Layer perceptron defines the most complex architecture of artificial neural networks. It is
substantially formed from multiple layers of the perceptron. TensorFlow is a very popular deep learning
framework released by, and this notebook will guide to build a neural network with this library. If we want
to understand what is a Multi-layer perceptron, The pictorial representation of multi-layer perceptron
learning is as shown below-
MLP networks are used for supervised learning format. A typical learning algorithm for MLP networks is
also called back propagation's algorithm.
A multilayer perceptron (MLP) is a feed forward artificial neural network that generates a set of outputs
from a set of inputs. An MLP is characterized by several layers of input nodes connected as a directed graph
between the input nodes connected as a directed graph between the input and output layers. MLP uses back
propagation for training the network. MLP is a deep learning method.
Here the speech recognition task involves distinguishing among 10 possible vowels, allspoken
in the context of "h_d" (i.e., "hid," "had," "head," "hood," etc.).
The network input consists of two parameters, F1 and F2, obtained from a spectral analysis of
the sound. The 10 network outputs correspond to the 10 possible vowel sounds. The network
prediction is the output whose value is highest.
The plot on the right illustrates the highly nonlinear decision surface represented by thelearned
network. Points shown on the plot are test examples distinct from the examples used to train
the network.
In simple terms, after each feed-forward passes through a network, this algorithm does the backward pass
to adjust the model’s parameters based on weights and biases. A typical supervised learning algorithm
attempts to find a function that maps input data to the right output. Back propagation works with a multi-
layered neural network and learns internal representations of input to output mapping.
1. Input layer
2. Hidden layer
The actual performance of backpropagation on a specific problem is dependent on the input data.
Back propagation algorithm can be quite sensitive to noisy data
The key idea behind the delta rule is to use gradient descent to search the hypothesis space
of possible weight vectors to find the weights that best fit the training examples.
To understand the delta training rule, consider the task of training a threshold perception. That is, a
linear unit for which the output O is given by
To derive a weight learning rule for linear units, specify a measure for the training error of a
hypothesis (weight vector), relative to the training examples.
Where,
D is the set of training examples,
td is the target output for training example d,
od is the output of the linear unit for training example d
E ( w→ ) is simply half the squared difference between the target output td and the linear unit
output od, summed over all training examples.
To understand the gradient descent algorithm, it is helpful to visualize the entire hypothesis
space of possible weight vectors and their associated E values as shown in below figure.
Here the axes w0 and wl represent possible values for the two weights of a simple linear unit.
The w0, wl plane therefore represents the entire hypothesis space.
The vertical axis indicates the error E relative to some fixed set of training examples.
The arrow shows the negated gradient at one particular point, indicating the direction inthe w0,
wl plane producing steepest descent along the error surface.
The error surface shown in the figure thus summarizes the desirability of every weight vector
in the hypothesis space
Gradient descent search determines a weight vector that minimizes E by starting with an arbitrary
initial weight vector, then repeatedly modifying it in small steps.
At each step, the weight vector is altered in the direction that produces the steepest descent along the
error surface depicted in above figure. This process continues until the global minimum error is
reached.
How to calculate the direction of steepest descent along the error surface?
The direction of steepest can be found by computing the derivative of E with respect to each
component of the vector w→ . This vector derivative is called the gradient of E with respect to
w→ , written as
The gradient specifies the direction of steepest increase of E, the training rule for gradient
descent is
Here η is a positive constant called the learning rate, which determines the step size in the
gradient descent search.
The negative sign is present because we want to move the weight vector in the direction
that decreases E.
Example of Feed Forward Network and back propagation in Real time: Face recognition
Face recognition using neural network explains about concept of improving performance of
detecting face by using neural technology. Fundamental part of face recognition is done through
Different set of multilayer neural network work on detection of face and back propagation
algorithm is used for error detection when the face is undetected by machine.
For example, in the 1995 paper titled “Human and machine recognition of faces: A survey,” the
authors describe three face recognition tasks:
Face Matching: Find the best match for a given face.
Face Similarity: Find faces that are most similar to a given face.
Face Transformation: Generate new faces that are similar to a given face.
With face detection, you can get the information you need to perform tasks like embellishing
selfies and portraits, or generating avatars from a user's photo.
Recognize and locate facial features Get the coordinates of the eyes, ears, cheeks, nose, and
mouth of every face detected.
Get the contours of facial features Get the contours of detected faces and their eyes, eyebrows,
lips, and nose.
Recognize facial expressions Determine whether a person is smiling or has their eyes closed.
Track faces across video frames Get an identifier for each unique detected face. The identifier
is consistent across invocations, so you can perform image manipulation on a particular person in
a video stream.
Process video frames in real time Face detection is performed on the device, and is fast enough
to be used in real-time applications, such as video manipulation.
Face alignment
Face alignment is a computer vision technology for identifying the geometric structure of
human faces in digital images. Given the location and size of a face, it automatically
determines the shape of the face components such as eyes and nose.
Feature extraction refers to the process of transforming raw data into numerical features
that can be processed while preserving the information in the original data set.
Feature matching refers to finding corresponding features from two similar images based on
a search distance algorithm. One of the images is considered the source and the other as target,
and the feature matching technique is used to either find or derive and transfer attributes from
source to target image
---------------------------------------------------------------------------------------------------------------------
the basic BACKPROPAGATION algorithm defines E in terms of the sum of squared errors
of the network, other definitions have been suggested in order to incorporate other constraints
Gamma is constant term and wji is new penalty weight adjusted for error reduction
1.Weight-update method
Direction: choosing a direction in which to alter or converge the current weight vector (ex:
the gradient in Backpropagation) which depends on Distance : choosing a distance to move
(ex: the learning ratio η )
Gradient descent is computationally efficient, provides a slow rate of convergence. This is where
line search comes into place and provides much better rate of convergence at a slight
increase in computational and accuracy
The conjugate gradient method is a mathematical technique that can be useful for the
optimization of both linear and non-linear systems. This technique is generally used as an
iterative algorithm, however, it can be used as a direct method, and it will produce a numerical
solution. Generally this method is used for very large systems where it is not practical to solve
with a direct method of Gradient Decent back propagation.
3.Recurrent Networks
A recurrent neural network (RNN) is a type of artificial neural network which uses sequential
data or time series data which are dynamic in nature. These deep learning algorithms are
commonly used for ordinal or temporal problems, such as language translation, natural language
Like feedforward and convolutional neural networks (CNNs), recurrent neural networks utilize
training data to learn. They are distinguished by their “memory” as they take information from
prior inputs to influence the current input and output within a specific time step .
Dynamic Neural networks can be considered as the improvement of the static neural networks in
which by adding more decision algorithms we can make neural networks learning dynamically
from the input and generate better quality results.
Modifying the network structure in hidden layer by adding and pruning the structure for better
results and accuracy.
-------------------------------------------------------------------------------------------------------------------
1. Boolean functions – Every boolean function can be represented exactly by some network
with two layers of units, although the number of hidden units required grows exponentially
in the worst case with the number of network inputs
Hypothesis space is the n-dimensional Euclidean space of the n network weights and
hypothesis space is continuous.
As it is continuous, E is differentiable with respect to the continuous parameters of the
hypothesis, results in a well-defined error gradient that provides a very useful structure for
organizing the search for the best hypothesis.
However, one can roughly characterize it as smooth interpolation between different data
nodes between input ,output and hiddenlayer .
BACKPROPAGATION can define new hidden layer features that are not explicit in the
input representation, but which capture properties of the input instances that are most
relevant to learning the target function.
5. Generalization of weights reduces errors, Over fitting data makes the error reduction
process complex, and Stopping Criterion when error is optimal when achieved global
minima point.
One choice is to continue training until the error E on the training examples falls below
some predeterminedthreshold.
To see the dangers of minimizing the error over the training data, consider how the
error Evaries with the number of weight iterations
The generalization accuracy measured over the validation examples first decreases,
thenincreases, even as the error over the training examples continues to decrease.
When the training data set increases over fitting increase the complexity of error
and reduction of the error may take n number of iterations to modify the weights
which becomes a tedious task.
The estimation is based on sample data and error frequency in different instances
This is made clear by distinguishing between the true error of a model and the estimated or
sample error.
The sample error (errors(h)) of hypothesis h with respect to target function f and data
sample S is
Where n is the number of examples in S, and the quantity δ(f(x), h(x)) is 1 if error is
identified
The true error (errorD(h)) of hypothesis h with respect to target function f and distribution
D, is the probability that h will misclassify an instance drawn at random accordingto D.
Suppose we wish to estimate the true error for some discrete valued hypothesis h, based on its
observed sample error over a sample S, where
The sample S contains n examples drawn independent of one another, and independent
of h, according to the probability distribution D
n ≥ 30
Hypothesis h commits r errors over these n examples (i.e., errors (h) = r/n).
Example:
Suppose the data sample S contains n = 40 examples and that hypothesis h commits r =
12 errors over this data.
The sample error is errors(h) = r/n = 12/40 = 0.30
Given no other information, true error is errorD (h) = errors(h), i.e., errorD (h)
=0.30
With the 95% confidence interval estimate for errorD (h).
3. A different constant, ZN, is used to calculate the N% confidence interval. The general
expression for approximate N% confidence intervals for errorD (h) is
Where,
Example:
Suppose the data sample S contains n = 40 examples and that hypothesis h commits r =
12 errors over this data.
The sample error is errors(h) = r/n = 12/40 = 0.30
With the 68% confidence interval estimate for errorD (h).
The application of sampling theory is concerned not only with the proper
selection of observations from the population that will
constitute the random sample; it also involves the use of
probability theory, along with prior knowledge about the
population parameters, to analyze the data from the random sample
and develop conclusions from the analysis
If the experiment were rerun, generating a new set of n coin tosses, we might expect the
number of heads r to vary somewhat from the value measured in the first experiment,
yielding a somewhat different estimate for p.
The Binomial distribution describes for each possible value of r (i.e., from 0 to n), the
probability of observing exactly r heads given a sample of n independent tosses of a
Null hypothesis :- In inferential statistics(make predictions (“inferences”) from that data.), the
null hypothesis is a general statement or default position that there is no relationship between two
measured phenomena, or no association among groups
In other words it is a basic assumption or made based on domain or problem knowledge.
Ex : a company production is = 50 unit/per day etc.
Alternative hypothesis :-
The alternative hypothesis is the hypothesis used in hypothesis testing that is contrary to the null
hypothesis. It is usually taken to be that the observations are the result of a real effect (with some
evidence)
Level of significance: Refers to the degree of significance in which we accept or reject the null-
hypothesis. 100% accuracy is not possible for accepting or rejecting a hypothesis, so we therefore
select a level of significance that is usually 5%.
This is normally denoted with alpha(maths symbol ) and generally it is 0.05 or 5% , which means
your output should be 95% confident to give similar kind of result in each sample.
One tailed test :- A test of a statistical hypothesis , where the region of rejection is on
only one side of the sampling distribution , is called a one-tailed test.
Two-tailed test :- A two-tailed test is a statistical test in which the critical area of a distribution
is two-sided and tests whether a sample is greater than or less than a certain range of values. If the
sample being tested falls into either of the critical areas, the alternative hypothesis is accepted
instead of the null hypothesis.
Some of widely used hypothesis testing type( not in syllabus)
1. T Test
Bayes Theorem is also widely used in the field of machine learning. Including its use in a
probability framework for fitting a model to a training dataset, referred to as maximum a
posteriori or MAP for short, and in developing models for classification predictive modeling
problems such as the Bayes Optimal Classifier and Naive Bayes.
Joint Probability: Probability of two (or more) simultaneous events, e.g. P(A and B) or P(A, B).
The conditional probability is the probability of one event given the occurrence of another event,
often described in terms of events A and B from two dependent random variables e.g. X and Y.
Conditional Probability: Probability of one (or more) event given the occurrence of another
event, e.g. P(A given B) or P(A | B).
The joint probability can be calculated using the conditional probability; for example:
P(A, B) = P(B, A)
The conditional probability can be calculated using the joint probability; for example:
Bayes theorem is a theorem in probability and statistics, named after the Reverend Thomas
Bayes, that helps in determining the probability of an event that is based on some event that has
already occurred. Bayes theorem has many applications such as bayesian interference, in the
healthcare sector - to determine the chances of developing health problems with an increase in
age and many others.
It can be helpful to think about the calculation from these different perspectives and help to map
your problem onto the equation.
Firstly, in general, the result P(A|B) is referred to as the posterior probability and P(A) is
referred to as the prior probability.
P(A|B): Posterior probability.
P(A): Prior probability.
Sometimes P(B|A) is referred to as the likelihood and P(B) is referred to as the evidence.
P(B|A): Likelihood.
P(B): Evidence.
This allows Bayes Theorem to be restated as:
Notations
P(h) prior probability of h, reflects any background knowledge about the
chance that h is correct
P(D) prior probability of D, probability that D will be observed
P(D|h) probability of observing D given a world in which h holds
P(h|D) posterior probability of h, reflects confidence that h holds after D
has been observed
Example
Consider a medical diagnosis problem in which there are two alternative
hypotheses:
(1) that the patient has particular form of cancer, and (2) that the patient
does not. The available data is from a particular laboratory test with two
possible outcomes: + (positive) and - (negative).
We have prior knowledge that over the entire population of people only
.008 have this disease. Furthermore, the lab test is only an imperfect
indicator of the disease.
The test returns a correct positive result in only 98% of the cases in which
the disease is actually present and a correct negative result in only 97% of
the cases in which the disease is not present. In other cases, the test
returns the opposite result.
The above situation can be summarized by the following probabilities:
Suppose a new patient is observed for whom the lab test returns a positive
(+) result. Should we diagnose the patient as having cancer or not?
The exact posterior probabilities can also be determined by normalizing the above
quantities so that they sum
What is the relationship between Bayes theorem and the problem of concept learning?
Let’s choose P(h) and for P(D|h) to be consistent with the following assumptions:
The training data D is noise free (i.e., di = c(xi))
The target concept c is contained in the hypothesis space H
Do not have a priori reason to believe that any hypothesis is more probable
than any other.
What values should we specify for P(h)?
Given no prior knowledge that one hypothesis is more likely than another,
it is reasonable to assign the same prior probability to every hypothesis h in
H.
Assume the target concept is contained in H and require that these prior
probabilities sum to 1.
P(D|h) is the probability of observing the target values D = (d1 . . .dm) for
the fixed set of instances (x1 . . . xm), given a world in which hypothesis h
holds
Since we assume noise-free training data, the probability of observing
classification di given h is just 1 if di = h(xi) and 0 if di ≠ h(xi). Therefore,
Given these choices for P(h) and for P(D|h) we now have a fully-defined problem
for the above BRUTE-FORCE MAP LEARNING algorithm.
Where, VSH,D is the subset of hypotheses from H that are consistent with D
To summarize, Bayes theorem implies that the posterior probability P(h|D) under our
assumed P(h) and P(D|h) is
A straightforward Bayesian analysis will show that under certain assumptions any
learning algorithm that minimizes the squared error between the output hypothesis
predictions and the training data will output a maximum likelihood (ML) hypothesis
Given the noise ei obeys a Normal distribution with zero mean and unknown
variance ζ2 , each di must also obey a Normal distribution around the true
targetvalue f(xi). Because we are writing the expression for P(D|h), we assume h is
the correct description of f.
Maximize the less complicated logarithm, which is justified because of the monotonicity
of function p
Thus, above equation shows that the maximum likelihood hypothesis hML is the one
that minimizes the sum of the squared errors between the observed training values
di and the hypothesis predictions h(xi)
Use of brute force way would be to first collect the observed frequencies of
1's and 0's for each possible value of x and to then train the neural network
to output the target frequency for each x.
Equation (4) to substitute for P(di |h, xi) in Equation (5) to obtain
Equation (7) describes the quantity that must be maximized in order to obtain the
maximum likelihood hypothesis in our current problem setting
Example:
FIND-S outputs a consistent hypothesis, it will output a MAP hypothesis
under the probability distributions P(h) and P(D|h) defined above.
Are there other probability distributions for P(h) and P(D|h) under which
FIND-S outputs MAP hypotheses? Yes.
Faculty Name : Mrs Swapna Subject Name :ML
Because FIND-S outputs a maximally specific hypothesis from the version
space, its output hypothesis will be a MAP hypothesis relative to any prior
probability distribution that favors more specific hypotheses.
Note
This equation (1) can be interpreted as a statement that short hypotheses are
preferred, assuming a particular representation scheme for encoding hypotheses
and data
-log2P(h): the description length of h under the optimal encoding for the
hypothesis space H, LCH (h) = −log2P(h), where CH is the optimal code for
hypothesis space H.
-log2P(D | h): the description length of the training data D given hypothesis
Where, CH and CD|h are the optimal encodings for H and for D given h
Where, codes C1 and C2 to represent the hypothesis and the data given the hypothesis
Bayes Optimal Classifier is a probabilistic model that finds the most probable prediction using
the training data and space of hypotheses to make a prediction for a new data instance.
To develop some intuitions consider a hypothesis space containing three hypotheses, hl, h2, and
h3. Suppose that the posterior probabilities of these hypotheses given the training data are .4, .3,
and .3 respectively. Thus, hl is the MAP hypothesis. Suppose a new instance x is encountered,
which is classified positive by hl, but negative by h2 and h3.
Taking all hypotheses into account, the probability that x is positive is .4 (the probability
associated with hi), and the probability that it is negative is therefore .6.
The most probable classification (negative) in this case is different from the classification
generated by the MAP hypothesis. In general, the most probable classification of the new
instance is obtained by combining the predictions of all hypotheses, weighted by their posterior
probabilities.
If the possible classification of the new example can take on any value vj from some set V, then
the probability P(vjlD) that the correct classification for the new instance is v;, is just
2.Naive Bayes. Assume that variables in the input data are conditionally independent.
1.Gibbs Algorithm
Gibbs sampling (also called alternating conditional sampling) is a Markov Chain Monte
Carlo algorithm for high-dimensional data such as image processing and micro arrays.
2.Naive Bayes:
Problem: If the weather is sunny, then the Player should play or not?
Outlook Play
0 Rainy Yes
1 Sunny Yes
2 Overcast Yes
3 Overcast Yes
4 Sunny No
5 Rainy Yes
6 Sunny Yes
7 Overcast Yes
8 Rainy No
9 Sunny No
10 Sunny Yes
12 Overcast Yes
13 Overcast Yes
Weather Yes No
Overcast 5 0
Rainy 2 2
Sunny 3 2
Total 10 4
Weather No Yes
Rainy 2 2 4/14=0.29
Sunny 2 3 5/14=0.35
Applying Bayes'theorem:
P(Sunny)= 0.35
P(Yes)=0.71
P(No|Sunny)= P(Sunny|No)*P(No)/P(Sunny)
P(Sunny|NO)= 2/4=0.5
P(No)= 0.29
P(Sunny)= 0.35
pVj is the prior probablility given the data belows to that class
example : Target class : the data belongs to the class (y or no) or (like or Dislike)
Bayesian belief network is key computer technology for dealing with probabilistic events and to
solve a problem which has uncertainty. We can define a Bayesian network as:
"A Bayesian network is a probabilistic graphical model which represents a set of variables and
their conditional dependencies using a directed acyclic graph."
It is also called a Bayes network, belief network, decision network, or Bayesian model.
Real world applications are probabilistic in nature, and to represent the relationship between
multiple events, we need a Bayesian network. It can also be used in various tasks
including prediction, anomaly detection, diagnostics, automated insight, reasoning, time
series prediction, and decision making under uncertainty.
The generalized form of Bayesian network that represents and solve decision problems under
uncertain knowledge is known as an Influence diagram.
A Bayesian network graph is made up of nodes and Arcs (directed links), where:
o Causal Component
o Actual numbers
Each node in the Bayesian network has condition probability distribution P(Xi |Parent(Xi) ),
which determines the effect of the parent on that node.
Let's understand the Bayesian network through an example by creating a directed acyclic graph:
Example: Harry installed a new burglar alarm at his home to detect burglary. The alarm reliably
responds at detecting a burglary but also responds for minor earthquakes. Harry has two
neighbors David and Sophia, who have taken a responsibility to inform Harry at work when they
hear the alarm. David always calls Harry when he hears the alarm, but sometimes he got
confused with the phone ringing and calls at that time too. On the other hand, Sophia likes to
listen to high music, so sometimes she misses to hear the alarm. Here we would like to compute
the probability of Burglary Alarm.
Problem:
Calculate the probability that alarm has sounded, but there is neither a burglary, nor an
earthquake occurred, and David and Sophia both called
Solution:
o Burglary (B)
o Earthquake(E)
o Alarm(A)
o David Calls(D)
o Sophia calls(S)
P(E= False)= 0.999, Which is the probability that an earthquake not occurred.
The Conditional probability of David that he will call depends on the probability of Alarm.
The Conditional probability of Sophia that she calls is depending on its Parent Node "Alarm."
From the formula of joint distribution, we can write the problem statement in the form of probability distribution:
= 0.00068045.
----------------------------------------------------------------------------------------------------
Expectation-Maximization Algorithm
Expectation-Maximization algorithm can be used for the latent variables (variables that are
not directly observable and are actually inferred from the values of the other observed
variables) too in order to predict their values with the condition that the general form of
probability distribution governing those latent variables is known to us. This algorithm is
actually at the base of many unsupervised clustering algorithms in the field of machine
learning.
It was explained, proposed and given its name in a paper published in 1977 by Arthur
Dempster, Nan Laird, and Donald Rubin. It is used to find the local maximum likelihood
parameters of a statistical model in the cases where latent variables are involved and the data
is missing or incomplete.
Algorithm:
1. Given a set of incomplete data, consider a set of starting parameters.
2. Expectation step (E – step): Using the observed available data of the dataset, estimate
(guess) the values of the missing data.
3. Maximization step (M – step): Complete data generated after the expectation (E) step is
used in order to update the parameters.
4. Repeat step 2 and step 3 until convergence.
Usage of EM algorithm –
It can be used to fill the missing data in a sample.
It can be used as the basis of unsupervised learning of clusters.
It can be used for the purpose of estimating the parameters of Hidden Markov Model (HMM).
It can be used for discovering the values of latent variables.
Advantages of EM algorithm –
It is always guaranteed that likelihood will increase with each iteration.
The E-step and M-step are often pretty easy for many problems in terms of implementation.
Disadvantages of EM algorithm –
It has slow convergence.
It makes convergence to the local optima only.
.
Probably Approximately Correct (PAC) framework, we identify classes of hypotheses that can and
cannot be learned from a polynomial number of training examples and we define a natural measure of
complexity for hypothesis spaces that allows bounding the number of training examples required for
learning. Within the mistake bound framework, we examine the number of training errors that will be
made by a learner before it determines the correct hypothesis
1.Sample complexity. How many training examples are needed for a learner to converge (with
high probability) to a successful hypothesis?
3. Mistake bound. How many training examples will the learner misclassify before converging to
a successful hypothesis?
Error of a Hypothesis
This is made clear by distinguishing between the true error of a model and the estimated or
sample error.
The sample error (errors(h)) of hypothesis h with respect to target function f anddata
sample S is
Where n is the number of examples in S, and the quantity δ(f(x), h(x)) is 1 if error is
identified
The true error (errorD(h)) of hypothesis h with respect to target function f and
distribution D, is the probability that h will misclassify an instance drawn at random
accordingto D.
Suppose we wish to estimate the true error for some discrete valued hypothesis h, based on its
observed sample error over a sample S, where
The sample S contains n examples drawn independent of one another, and independent of h,
according to the probability distribution D
n ≥ 30
Hypothesis h commits r errors over these n examples (i.e., errors (h) = r/n).
Under these conditions, statistical theory allows to make the following assertions:
2. Given no other information, the most probable value of errorD (h) is errors(h)
3. With approximately 95% probability, the true error errorD (h) lies in the interval
Suppose the data sample S contains n = 40 examples and that hypothesis h commits r = 12
errors over this data.
The sample error is errors(h) = r/n = 12/40 = 0.30
Given no other information, true error is errorD (h) = errors(h), i.e., errorD (h) =0.30
With the 95% confidence interval estimate for errorD (h).
The true error can be caused by Sampling error can be of type population-specific error
poor data collection methods, (wrong people to survey), selection error, sample-frame
selection bias, or non-response error (wrong frame window selected for sample), and
bias. non-response error (when respondent failed to respond).
We begin by specifying the problem setting that defines the PAC learning model, then consider the
questions of how many training examples and how much computation are required in order to learn
various classes of target functions within this PAC model.
PAC-learnability is largely determined by the number of training examples required by the learner. The
growth in the number of required training examples with problem size, called the sample complexity of
the learning problem
a general bound on the sample complexity for a very broad class of learners, called consistent learners.
A learner is consistent if it outputs hypotheses that perfectly fit the training data.
The learner L considers some set H of possible hypotheses when attempting to learn the target concept.
For example, H might be the set of all hypotheses describable by conjunctions of the attributes. After
observing a sequence of training examples of the target concept c, L must output some hypothesis h
from H, which is its estimate of c.
To be fair, we evaluate the success of L by the performance of h over new instances drawn randomly
from X according to D(Traning data), the same probability distribution used to generate the training data
Error of a Hypothesis
Figure 7.1 shows this definition of error in graphical form. The concepts c and h are depicted by the sets
of instances within X that they label as positive. The error of h with respect to c is the probability that a
randomly drawn instance will fall into the region where h and c disagree
Our aim is to characterize classes of target concepts that can be reliably learned from a reasonable
number of randomly drawn training examples and a reasonable amount of computation
First, unless we provide training examples corresponding to every possible instance in X (an unrealistic
assumption), there may be multiple hypotheses consistent with the provided training examples, and the
learner cannot be certain to pick the one corresponding to the target concept. Second, given that the
training examples are drawn randomly, there will always be some nonzero probability that the training
examples encountered by the learner will be misleading
To accommodate these two difficulties, we weaken our demands on the learner in two ways. First, we
will not require that the learner output a zero error hypothesis-we will require only that its error be
bounded by some constant, c, that can be made arbitrarily small.
The growth in the number of required training examples with problem size, called the sample
complexity of the learning problem
we present a general bound on the sample complexity for a very broad class of learners, called
consistent learners. A learner is consistent if it outputs hypotheses that perfectly fit the training
data
The significance of the version space here is that every consistent learner outputs a hypothesis
belonging to the version space, regardless of the instance space X, hypothesis space H, or
training data D. The reason is simply that by definition the version space VSH,D contains every
Sample Complexity Results for Infinite Hypothesis Spaces can be explained with concept of
shattering coefficient
The Shattering Coefficient Let C be a concept class over an instance space X, i.e. a set of
functions functions from X to {0, 1} (where both C and X may be infinite).
The ability to shatter a set of instances is closely related to the inductive bias of a hypothesis
space. Algorithm
Shatter or a shattered set in the case of a dataset, means points in the feature space can be
selected or separated from each other using hypotheses in the space such that the labels of
examples in the separate groups are correct
the mistake bound model of learning, in which the learner is evaluated by the total number of mistakes
it makes before it converges to the correct hypothesis. As in the PAC setting, we assume the learner
receives a sequence of training examples.
However, here we demand that upon receiving each example x, the learner must predict the target
value c(x), before it is shown the correct target value by the trainer. The question considered is "How
many mistakes will the learner make in its predictions before it learns the target concept?' This question
is significant in practical settings where learning must be done while the system is in actual use, rather
than during some off-line training stage.
For example, if the system is to learn to predict which credit card purchases should be approved and
which are fraudulent, based on data collected during use, then we are interested in minimizing the total
number of mistakes it will make before converging to the correct target function. Here the total number
of mistakes can be even more important than the total number of training examples.
This mistake bound learning problem may be studied in various specific settings. For example, we might
count the number of mistakes made before PAC learning the target concept. In the examples below, we
To illustrate, consider again the hypothesis space H consisting of conjunctions of up to n boolean literals
l1 , l2…ln, and their negations Recall the FIND-S algorithm , which incrementally computes the maximally
specific hypothesis consistent with the training examples. A straightforward implementation of FIND-S
for the hypothesis space H is as follow
FIND-S converges in the limit to a hypothesis that makes no errors, provided C, H and provided the
training data is noise-free. FIND-S begins with the most specific hypothesis (which classifies every
instance a negative example), then incrementally generalizes this hypothesis as needed to cover
observed positive training examples. For the hypothesis representation used here, this generalization
step consists of deleting unsatisfied literals.
Therefore, to calculate the number of mistakes it will make, we need only count the number of mistakes
it will make misclassifying truly positive examples as negative.
Therefore, to calculate the number of mistakes it will make, we need only count the number of mistakes
it will make misclassifying truly positive examples as negative.
How many such mistakes can occur before FIND-S learns c exactly? Consider the first positive example
encountered by FIND-S. The learner will certainly make a mistake classifying this example, because its
initial hypothesis labels every instance negative. However, the result will be that 1/ 2n terms in its initial
hypothesis will be eliminated, leaving only n terms. For each subsequent positive example that is
mistakenly classified by the current hypothesis, at least one more of the remaining n terms must be
eliminated from the hypothesis.
---------------------------------------------------------------------------------------------------------------------------------------
The Machine Learning systems which are categorized as instance-based learning are the
systems that learn the training examples by heart and then generalizes to new instances based
on some similarity measure. It is called instance-based because it builds the hypotheses from
the training instances. It is also known as memory-based learning or lazy-learning. The time
complexity of this algorithm depends upon the size of training data.
The most basic instance-based method is the K- Nearest Neighbor Learning. This
algorithm assumes all instances correspond to points in the n-dimensional space Rn.
The value �(xq) returned by this algorithm as its estimate of f(x q) is just the
most common value of f among the k training examples nearest to xq.
If k = 1, then the 1- Nearest Neighbor algorithm assigns to �(xq) the value f(xi).
Where xi is the training instance nearest to xq.
For larger values of k, the algorithm assigns the most common value among the k
nearest training examples.
Below figure illustrates the operation of the k-Nearest Neighbor algorithm for the
case where the instances are points in a two-dimensional space and where the
target function is Boolean valued.
Below figure shows the shape of this decision surface induced by 1- Nearest
Neighbor over the entire instance space. The decision surface is a combination of
convex polyhedra surrounding each of the training examples.
For every training example, the polyhedron indicates the set of query points
whose classification will be completely determined by that training example.
Query points outside the polyhedron are closer to some other training
example. This kind of diagram is often called the Voronoi diagram of the set of
training example
Example of K Nearest
We can understand the concept of regression analysis using the below example:
Example: Suppose there is a marketing company A, who does various advertisement every year and get sales on that.
The below list shows the advertisement made by the company in the last 5 years and the corresponding sales:
Now, the company wants to do the advertisement of $200 in the year 2019 and wants to know the prediction
about the sales for this year. So to solve such type of prediction problems in machine learning, we need regression
analysis.
The linear regression model provides a sloped straight line representing the relationship between the variables.
Consider the below image:
y= a0+a1x+ ε
Y= Dependent Variable (Target Variable)
X= Independent Variable (predictor Variable)
a0= intercept of the line (Gives an additional degree of freedom)
a1 = Linear regression coefficient (scale factor to each input value).
ε = random error
The values for x and y variables are training datasets for Linear Regression model representation.
2.6M
Obi-Wan Is Coming to ‘
for Linear Regression, we use the Mean Squared Error (MSE) cost function, which is the average of squared error
occurred between the predicted values and actual values. It can be written as:
The phrase "locally weighted regression" is called local because the function is
approximated based only on data near the query point, weighted because the
contribution of each training example is weighted by its distance from the
query point, and regression because this is the term used widely in the
statistical learning community for the problem of approximating real-valued
functions.
Given a new query instance xq, the general approach in locally weighted
regression is to construct an approximation 𝑓 that fits the training examples in
the neighborhood surrounding xq. This approximation is then used to calculate
the value 𝑓(xq), which is output as the estimated target value for the query
instance.
Consider locally weighted regression in which the target function f is
Derived methods are used to choose weights that minimize the squared error
summed over the set D of training examples using gradient descent
Need to modify this procedure to derive a local approximation rather than a global one.
The simple way is to redefine
The radial basis function network uses radial basis functions as its activation functions. Like
other kinds of neural networks, radial basis function networks have input layers, hidden layers
and output layers. However, radial basis function networks often also include a nonlinear
activation function of some kind. Output weights can be trained using gradient descent.
2. Reuse- Suggesting a solution based on the experience and adapting it to meet the demands of the
new situation.
On the plus side, remembering past experiences helps learners avoid repeating previous mistakes, and
the reasoned can discern what features of a problem are significant and focus on them.
On the negative side, critics claim that the main premise of CBR is based on anecdotal evidence and that
adapting the elements of one case to another may be complex and potentially lead to inaccuracies
Eager learner:
Genetic Algorithms(GAs) are adaptive heuristic search algorithms that belong to the larger part of
evolutionary algorithms. Genetic algorithms are based on the ideas of natural selection and genetics.
These are intelligent exploitation of random search provided with historical data to direct the search
into the region of better performance in solution space. They are commonly used to generate high-
quality solutions for optimization problems and search problems.
Genetic algorithms simulate the process of natural selection which means those species who can
adapt to changes in their environment are able to survive and reproduce and go to next generation. In
simple words, they simulate “survival of the fittest” among individual of consecutive generation for
solving a problem. Each generation consist of a population of individuals and each individual
represents a point in search space and possible solution. Each individual is represented as a string of
character/integer/float/bits. This string is analogous to the Chromosome.
• Initial population
• Fitness function
• Selection
• Crossover
• Mutation
• The process begins with a set of individuals which is called a Population. Each individual is a
solution to the problem you want to solve.
• An individual is characterized by a set of parameters (variables) known as Genes. Genes are
joined into a string to form a Chromosome (solution).
• In a genetic algorithm, the set of genes of an individual is represented using a string, in terms of
an alphabet. Usually, binary values are used (string of 1s and 0s). We say that we encode the
genes in a chromosome
• The fitness function determines how fit an individual is (the ability of an individual to compete
with other individuals). It gives a fitness score to each individual. The probability that an
individual will be selected for reproduction is based on its fitness score.
Selection
Faculty Name : Mrs Swapna Subject Name :ML
• The idea of selection phase is to select the fittest individuals and let them pass their genes to
the next generation.
• Two pairs of individuals (parents) are selected based on their fitness scores. Individuals with
high fitness have more chance to be selected for reproduction.
Crossover
• Crossover is the most significant phase in a genetic algorithm. For each pair of parents to be
mated, a crossover point is chosen at random from within the genes.
Mutation
• In certain new offspring formed, some of their genes can be subjected to a mutation with a low
random probability. This implies that some of the bits in the bit string can be flipped.
In certain new offspring formed, some of their genes can be subjected to a mutation with a low random
probability. This implies that some of the bits in the bit string can be flipped.
Termination
The algorithm terminates if the population has converged (does not produce offspring which are
significantly different from the previous generation). Then it is said that the genetic algorithm has
provided a set of solutions to our problem.
Faculty Name : Mrs Swapna Subject Name :ML
Application areas
Transport: Genetic algorithms are used in the traveling salesman problem to develop transport
plans that reduce the cost of travel and the time taken. They are also used to develop an efficient
way of delivering products.
DNA Analysis: They are used in DNA analysis to establish the DNA structure using
spectrometric information.
Multimodal Optimization: They are used to provide multiple optimum solutions in multimodal
optimization problems.
Aircraft Design: They are used to develop parametric aircraft designs. The parameters of the
aircraft are modified and upgraded to provide better designs.
Economics: They are used in economics to describe various models such as the game theory,
asset pricing, and schedule optimization.
Baldwin Effect:
Baldwin proposed that individual learning can explain evolutionary phenomena that appear to require
Lamarckian inheritance of acquired characteristics. The ability of individuals to learn can guide the
evolutionary process. In effect, learning smooths the fitness landscape, thus facilitating evolution.
These genetic algorithms do not depend on each other, as a result, they can run in parallel, taking
advantage of a multicore CPU. Each algorithm has its own set of individual, as a result these individuals
may differ from individuals of another algorithm, because they have different mutation/crossover history.
A parallel genetic algorithm may take a little more time than a non-parallel one, that is because is uses
several computation threads which, in turn, cause the Operation System to perform context switching
more frequently
Distributed genetic algorithm is actually a parallel genetic algorithm that has its independent algorithms
running on separate machines. Moreover, in this case each of these algorithms may be in turn a parallel
genetic algorithm! Distributed genetic algorithm also implements the „island model‟ and each „island‟ is
even more isolated from others. If each machine runs a parallel genetic algorithm we may call this as
„archipelago model‟, because we have groups of islands. It actually does not matter what a single genetic
algorithm is, because distributed genetic algorithm is about having multiple machines running
independent genetic algorithms in order to solve the same task. Image 2 illustrates this.
When we were discussing parallel genetic algorithm we introduced the „crossover between algorithms‟
term. Distributed genetic algorithm enables us to perform crossover between separate machines
In case of distributed genetic algorithm, we have a kind of „master mind‟ that controls the overall
progress and coordinates these machines. It also controls crossover between machines, selecting how
machines will be paired together to perform crossover. In general, process is the same as in case of
parallel genetic algorithm, except that individuals are moved over the network from one machine to
another.
Sequential Covering is a popular algorithm based on Rule-Based Classification used for learning a
disjunctive set of rules. The basic idea here is to learn one rule, remove the data that it covers, then
repeat the same process. In this process, In this way, it covers all the rules involved with it in a
sequential manner during the training phase.
The Sequential Learning algorithm takes care of to some extent, the low coverage problem in the
Learn-One-Rule algorithm covering all the rules in a sequential manner.
Step 3 – The rule becomes ‘desirable’ when it covers a majority of the positive examples.
Step 4 – When this rule is obtained, delete all the training data associated with that rule.
(i.e. when the rule is applied to the dataset, it covers most of the training data, and has to be removed)
Step 5 – The new rule is added to the bottom of decision list, ‘R’. (Fig.3)
Propositional logic:
• Propositional logic (PL) is the simplest form of logic where all the statements are made by
propositions. A proposition is a declarative statement which is either true or false. It is a
Propositional logic is also called Boolean logic as it works on 0 and 1.
• In propositional logic, we use symbolic variables to represent the logic, and we can use any
symbol for a representing a proposition, such A, B, C, P, Q, R, etc.
• Propositions can be either true or false, but it cannot be both.
• Propositional logic consists of an object, relations or function, and logical connectives.
• These connectives are also called logical operators.
• The propositions and connectives are the basic elements of the propositional logic.
Faculty Name : Mrs Swapna Subject Name :ML
• Connectives can be said as a logical operator which connects two sentences.
• A proposition formula which is always true is called tautology, and it is also called a valid
sentence.
• A proposition formula which is always false is called Contradiction.
• A proposition formula which has both true and false values is called
• Statements which are questions, commands, or opinions are not propositions such as "Where is
Rohini", "How are you", "What is your name", are not propositions.
• technique of knowledge representation in logical and mathematical form.
• Syntax of propositional logic:
• The syntax of propositional logic defines the allowable sentences for the knowledge
representation. There are two types of Propositions:
• Atomic Propositions
• Compound propositions
• Atomic Proposition: Atomic propositions are the simple propositions. It consists of a single
proposition symbol. These are the sentences which must be either true or false.
• Example:
• a) 2+2 is 4, it is an atomic proposition as it is a true fact.
• b) "The Sun is cold" is also a proposition as it is a false fact.
• Compound proposition: Compound propositions are constructed by combining simpler or
atomic propositions, using parenthesis and logical connectives.
• Example:
• a) "It is raining today, and street is wet."
• b) "Ankit is a doctor, and his clinic is in Mumbai."
• Limitations of Propositional logic:
• We cannot represent relations like ALL, some, or none with propositional logic. Example:
• All the girls are intelligent.
• Some apples are sweet.
• Propositional logic has limited expressive power.
• In propositional logic, we cannot describe statements in terms of their properties or logical
relationships.
First-Order Logic
A quantifier is a language element which generates quantification, and quantification specifies the
quantity of specimen in the universe of discourse.
These are the symbols that permit to determine or identify the range and scope of the variable in the
logical expression. There are two types of quantifier:
Universal Quantifier:
• Universal quantifier is a symbol of logical representation, which specifies that the statement within its
range is true for everything or every instance of a particular thing.
Existential Quantifier:
Existential quantifiers are the type of quantifiers, which express that the statement within its scope is true
for at least one instance of something.
It is denoted by the logical operator ∃, which resembles as inverted E. When it is used with a predicate
variable then it is called as an existential quantifier.
If x is a variable, then existential quantifier will be ∃x or ∃(x). And it will be read as:
Example:
It will be read as: There are some x where x is a boy who is intelligent.
A different approach to inductive logic programming is based on the simple observation that
induction is just the inverse of deduction.
INVERTING RESOLUTION :A general method for automated deduction is the resolution rule
introduced by Robinson (1965). The resolution rule is a sound and complete rule for deductive
inference in first-order logic.
Reinforcement learning addresses the question of how an autonomous agent that senses and acts in
its environment can learn to choose optimal actions to achieve its goals. In general, a reinforcement
learning agent is able to perceive and interpret its environment, take actions and learn through trial
and error.
Consider building a learning robot. The robot, or agent, has a set of sensors to observe the
state of its environment, and a set of actions it can perform to alter this state.
Its task is to learn a control strategy, or policy, for choosing actions that achieve its goals.
The goals of the agent can be defined by a reward function that assigns a numerical value to
each distinct action the agent may take from each distinct state.
This reward function may be built into the robot, or known only to an external teacher who
provides the reward value for each action performed by the robot.
The task of the robot is to perform sequences of actions, observe their consequences, and
learn a control policy.
The control policy is one that, from any initial state, chooses actions that maximize the reward
accumulated over time by the agent.
Example:
A mobile robot may have sensors such as a camera and sonars, and actions such as "move
forward" and "turn."
The robot may have a goal of docking onto its battery charger whenever its battery level is low.
The goal of docking to the battery charger can be captured by assigning a positive reward
(Eg., +100) to state-action transitions that immediately result in a connection to the charger
RL works by interacting with the environment. Supervised learning works on the existing dataset.
The RL algorithm works like the human brain works Supervised Learning works as when a human learns th
when making some decisions. the supervision of a guide.
No previous training is provided to the learning agent. Training is provided to the algorithm so that it can pre
output.
RL helps to take decisions sequentially. In Supervised learning, decisions are made when i
given.
1. Positive –
Positive Reinforcement is defined as when an event, occurs due to a particular behavior, increases
the strength and the frequency of the behavior. In other words, it has a positive effect on behavior.
2. Negative –
Negative Reinforcement is defined as strengthening of behavior because a negative condition
is stopped or avoided.
From the above discussion, we can say that Reinforcement Learning is one of the most interesting and
useful parts of Machine learning. In RL, the agent explores the environment by exploring it without any
human intervention. It is the main learning algorithm that is used in Artificial Intelligence. But there are
some cases where it should not be used, such as if you have enough data to solve the problem, then other
ML algorithms can be used more efficiently. The main issue with the RL algorithm is that some of the
parameters may affect the speed of the learning, such as delayed feedback.
Q-learning is an off policy reinforcement learning algorithm that seeks to find the best action to take
given the current state. It’s considered off-policy because the q-learning function learns from actions
that are outside the current policy, like taking random actions, and therefore a policy isn’t needed.
More specifically, q-learning seeks to learn a policy that maximizes the total reward.
What’s ‘Q’?
The ‘q’ in q-learning stands for quality. Quality in this case represents how useful a given action is in
gaining some future reward.
Create a q-table
When q-learning is performed we create what’s called a q-table or matrix that follows the shape
of [state, action] and we initialize our values to zero. We then update and store our q-values after
an episode.
The next step is simply for the agent to interact with the environment and make updates to the state
action pairs in our q-table Q[state, action].
An agent interacts with the environment in 1 of 2 ways. The first is to use the q-table as a reference
and view all possible actions for a given state. The agent then selects the action based on the max
value of those actions. This is known as exploiting since we use the information we have available
to us to make a decision.
The second way to take action is to act randomly. This is called exploring. Instead of selecting
actions based on the max future reward we select an action at random. Acting randomly is
important because it allows the agent to explore and discover new states that otherwise may not be
selected during the exploitation process.
This update rule to estimate the value of Q is applied at every time step of the agents interaction with the
environment. The terms used are explained below. :
A‟ : Next best action to be picked using current Q-value estimation, i.e. pick the action with the
maximum Q-value in the next state.
R : Current Reward observed from the environment in Response of current action.
GAMMA=(>0 and <=1) : Discounting Factor for Future Rewards. Future rewards are less valuable than
current rewards so they must be discounted. Since Q-value is an estimation of expected rewards from a
state, discounting rule applies here as well.
ALPHA : Step length taken to update the estimation of Q(S, A).
It is a supervised learning process in which the training signal for a prediction is a future
prediction. TD algorithms are often used in reinforcement learning to predict a measure of the total
amount of reward expected over the future.
TD Learning focuses on predicting a variable's future value in a sequence of states. Temporal difference
learning was a major breakthrough in solving the problem of reward prediction. You could say that iIt
employs a mathematical trick that allows it to replace complicated reasoning with a simple learning
procedure that can be used to generate the very same results.
The trick is that rather than attempting to calculate the total future reward, temporal difference learning
just attempts to predict the combination of immediate reward and its own reward prediction at the next
moment in time. Now when the next moment comes and brings fresh information with it, the new
prediction is compared with the expected prediction. If these two predictions are different from each
other, the TD algorithm will calculate how different the predictions are from each other and make use of
this temporal difference to adjust the old prediction toward the new prediction.
Dynamic Programming
Dynamic programming algorithms solve a category of problems called planning problems. Herein given
the complete model and specifications of the environment , we can successfully find an optimal policy
for the agent to follow. It contains two main steps:
• Dynamic programming is a method for solving complex problems by breaking them down into
sub-problems. The solutions to the sub-problems are combined to solve overall problem.
Analytical Learning-1
Inductive learning methods such as neural network and decision tree learning require a certain number of training
examples to achieve a given level of accuracy.
Analytical learning uses prior knowledge and deductive reasoning to augment the information provided by the
training examples. This chapter considers an analytical learning method called explanation-based learning (EBL)
In explanation-based learning, prior knowledge is used to analyze, or explain, how each observed training example
satisfies the target concept.
This explanation is then used to distinguish the relevant features of the training example from the irrelevant features
Explanation-based learning has been successfully applied to learning search control rules for a variety of planning
and scheduling tasks.
Inductive Learning methods: that is, methods that generalize from observed training examples by identifying
features that empirically distinguish positive from negative training examples.
Decision tree learning, neural network learning, inductive logic programming, and genetic algorithms are all
examples of inductive methods that operate in this fashion. The key practical limit on these inductive learners is that
they perform poorly when insufficient data is available.
Explanation-based learning is one such approach. It uses prior knowledge to analyze, or explain, each training
example in order to infer which example features are relevant to the target function and which are irrelevant. These
explanations enable it to generalize more accurately than inductive systems that rely on the data.
Explanation based learning uses prior knowledge to reduce the complexity of the hypothesis space to be searched,
thereby reducing sample complexity and improving generalization accuracy of the learner it is supported with
evidence called domain theory(it is refers to the evidence that supports the prior data ) .
Given this instance space, the task is to learn the target concept "pairs of physical objects, such that one can be
stacked safely on the other," denoted by the predicate SafeToStack(x,y). Learning this target concept might be
useful, for example, to a robot system that has the task of storing various physical objects within a limited
workspace. The full definition of this analytical learning task is given in
Table 11.1.
This section presents an algorithm called PROLOG-EBG (Kedar-Cabelli and McCarty 1987) that is representative
of several explanation-based learning algorithms.
PROLOG-EBG is a sequential covering algorithm. Prolog stands for programming in logic. it is a logic
programming language for artificial intelligence. An artificial intelligence developed in Prolog will examine the
link between a fact, a true statement, and a rule, a conditional statement, in order to come up with a question, or end
objective.
PROLOG-EBG is guaranteed to output a hypothesis (set of rules) that is itself correct and that covers the observed
positive training examples. For any set of training examples, the hypothesis output by PROLOG-EBG constitutes a
set of logically sufficient conditions for the target concept, according to the domain theory.
A domain theory is said to be correct if each of its assertions is a truthful statement about the world.
A domain theory is said to be complete with respect to a given target concept and instance space, if the domain
theory covers every positive example in the instance space.
PROLOG-EBG computes the weakest preimage of the target concept with respect to the explanation, using a
general procedure called regression (Waldinger 1977). The regression procedure operates on a domain theory
represented by an arbitrary set of Horn clauses.
It works iteratively backward through the explanation, first computing the weakest preimage of the target concept
with respect to the final proof step in the explanation, then computing the weakest preimage of the resulting
expressions with respect to the preceding step, and so on.
The procedure terminates when it has iterated over all steps :in the explanation, yielding the weakest precondition
of the target concept with respect to the literals at the leaf nodes of the explanation
Example :
friends(raju, mahesh).
singer(sonu).
odd_number(5).
Explanation :
These facts can be interpreted as :
Running queries :
Query 2 : ?- odd_number(7).
Output : No.
This kind of knowledge reformulation is sometimes referred to as knowledge compilation, indicating that the
transformation is an efficiency improving one
PRODIGY and SOAR demonstrate that explanation-based learning methods can be successfully applied to acquire
search control knowledge in a variety of problem domains.
The PRODIGY architecture was initially conceived by Jaime Carbonell and Steven Minton, as an Artificial
Intelligence (AI) system to test and develop ideas on the role of machine learning in planning and problem solving.
In general, learning in problem solving seemed meaningless without measurable performance improvements.
Thus, PRODIGY was created to be a testbed for the systematic investigation of the loop between learning and
performance in planning systems.
As a result, PRODIGY consists of a core general- purpose planner and several learning modules that refine both the
planning domain knowledge and the control knowledge to guide the search process effectively
The Algorithm
Table 4 shows the basic procedure to learn quality-enhancing control knowledge, in the case that a
human expert provides a better plan. Steps 2, 3 and 4 correspond to the interactive plan checking
module, that asks the expert for a better solution and checks for its correctness. Step 6 constructs a
problem solving trace from the expert solution and obtains decision points where control knowledge is
needed, which in turn become learning opportunities. Step 8 corresponds to the actual learning phase. It
compares the plan trees obtained from the problem solving traces in Step 7, explains why one solution
was better than the other, and builds new control knowledge. These steps are described now in detail.
5. Compute the partial order for identifying the goal dependencies between
plan steps.
6. Construct a problem solving trace corresponding to a solution that satisfies
.
This determines the set of decision points in the problem solving trace where
control knowledge is missing.
7. Build the plan trees and , corresponding respectively to the search trees for and
.
8. Compare and explaining why is better than , and build control rules.
What is SOAR?
SOAR platforms have three main components: security orchestration, security automation and security response.
Faster incident detection and reaction times. The volume and velocity of security threats
and events are constantly increasing. SOAR's improved data context, combined with
automation, can bring lower mean time to detect (MTTD) and mean time to respond (MTTR).
By detecting and responding to threats more quickly, their impact can be lessened.
Better threat context. By integrating more data from a wider array of tools and systems,
SOAR platforms can offer more context, better analysis and up-to-date threat information.
Scalability. Scaling time-consuming manual processes can be a drain on employees and even
impossible to keep up with as security event volume grows. SOAR's orchestration, automation
and workflows can meet scalability demands more easily.
specific properties
1. Given no domain theory, it should learn at least as effectively as purely inductive methods.
2. Given a perfect domain theory, it should learn at least as effectively as purely analytical methods.
3. Given an imperfect domain theory and imperfect training data, it should combine the two to out
perform either purely inductive or purely analytical methods.
4. It should accommodate an unknown level of error in the training data.
Faculty Name : Mrs Swapna Subject Name :ML
5. It should accommodate an unknown level of error in the domain theory
INDUCTIVE-ANALYTICAL APPROACHES TO LEARNING
Given:
Determine:
A hypothesis that best fits the training examples and domain theory
To address this learning problem we develop a hypothesis space search combining both
inductive and analytical approaches
we explore three different methods for using prior knowledge to alter the search performed by purely
inductive methods.
1. USE PRIOR KNOWLEDGE TO DERIVE AN INITIAL HYPOTHESIS FROM WHICH TO BEGIN THE SEARCH
2. USE PRIOR KNOWLEDGE TO ALTER THE OBJECTIVE OF THE HYPOTHESIS SPACE SEARCH.
1. Use prior knowledge to derive an initial hypothesis from which to begin the search
In this approach the domain theory B is used to construct an initial hypothesis ho that is consistent with
B. A standard inductive method is then applied, starting with the initial hypothesis ho.
This approach is used by the KBANN (Knowledge-Based Artificial Neural Network) algorithm to learn
artificial neural networks.
In KBANN an initial network is first constructed so that for every possible instance, the classification
assigned by the network is identical to that assigned by the domain theory.
However, if the initial hypothesis is found to imperfectly classify the training examples, then it will be
refined inductively to improve its fit to the training examples.
The KBANN algorithm exemplifies the initialize-the-hypothesis approach to using domain theories.
In doing so, it modifies the network weights using backpropgation as needed to overcome
inconsistencies between the domain theory and observed data.
Limitations of KBANN include the fact that it can accommodate only propositional domain theories
It is also possible for KBANN to be misled when given highly inaccurate domain theories, so that its
generalization accuracy can deteriorate below the level of BACKPROPAGATION
For example, the EBNN(EXPLANATION BASED NEURAL NETWORK) system described below learns
neural networks in this way. Whereas inductive learning of neural networks performs gradient descent
search to minimize the squared error of the network over the training data, EBNN performs gradient
descent to optimize a different criterion. This modified criterion includes an additional term that
measures the error of the learned network relative to the domain theory.
The TANGENTPROP Algorithm TANGENTPROP (Simard et al. 1992) accommodates domain knowledge
expressed as derivatives of the target function with respect to transformations of its inputs. Consider a
learning task involving an instance space X and target function f.
The TANGENTPROP algorithm assumes various training derivatives of the target function are also
provided. For example, if each instance xi is described by a single real value, then each training example
may be of the form (xi, f (xi), q lx, ). Here lx, denotes the derivative of the target function f with respect
to x, evaluated at the point x = xi.
To develop an intuition for the benefits of providing training derivatives as well as training values during
learning, consider the simple learning task depicted in Figure
The task is to learn the target function f shown in the leftmost plot of the figure, based on the three
training examples shown: (xl, f (xl)), (x2, f (x2)), and (xg, f (xg)).
Given these three training examples, the BACKPROPAGATION algorithm can be expected to hypothesize
a smooth function, such as the function g depicted in the middle plot of the figure. The rightmost plot
shows the effect of
providing training derivatives, or slopes, as additional information for each training example (e.g., (XI, f
(XI), I,, )). By fitting both the training values f (xi) and these training derivatives PI,, the learner has a
better chance to correctly generalize from the sparse training data.
To summarize, the impact of including the training derivatives is to override the usual syntactic inductive
bias of BACKPROPAGATION that favors a smooth interpolation between points, replacing it by explicit
input information about required derivatives. The resulting hypothesis h shown in the rightmost plot of
the figure provides a much more accurate estimate of the true target function f.
In the Figure one f(X) are the hypothesis and x1 , x2 ,x3 are the instances and these instances fit to
proper hypothesis shown in first figure and in second fig we can see the instances classified and
machine learns to fit to proper hypothesis by doing necessary modification by using
TANGEPROP considers the squared error between the specified training derivative and the actual
derivative of the learned neural network. The modified error function is
where p is a constant provided by the user to determine the relative importance of fitting training values
versus fitting training derivatives.
Notice the first term in this definition of E is the original squared error of the network versus training
values, and the second term is the squared error in the network versus training derivatives.
In the third figure we can see the instances are classified properly and maintaining accuracy.
An Illustrative Example
It combines this prior knowledge with observed training data, by minimizing an objective function that
measures both the network's error with respect to the training example values (fitting the data) and its
error with respect to the desired derivatives (fitting the prior knowledge).
TANGENTPROP incorporates prior knowledge to influence the hypothesis search by altering the
objective function to be minimized by gradient descent
TANGENTPROP objective will be a subset of those satisfying the weaker BACKPROPAGATION objective.
The difference between these two sets of final hypotheses is the set of incorrect hypotheses that will be
considered by BACKPROPAGATION, but ruled out by TANGENTPROP due to its prior knowledge.
The EBNN (Explanation-Based Neural Network learning) algorithm (Mitchell and Thrun 1993a; Thrun
1996) builds on the TANGENTPROP algorithm in two significant ways.
First, instead of relying on the user to provide training derivatives, EBNN computes training derivatives
itself for each observed training example. These training derivatives are calculated by explaining each
training example in terms of a given domain theory, then extracting training derivatives from this
explanation.
The inputs to EBNN include (1) a set of training examples of the form (xi, f (xi)) with no training
derivatives provided, and (2) a domain theory analogous to that used in explanation-based learning and
in KBANN, but represented by a set of previously trained neural networks The output of EBNN is a new
neural network that approximates the target function f. This learned network is trained to fit both the
training examples (xi, f (xi)) and training derivatives of f extracted from the domain theory. Fitting the
training examples (xi, f (xi)) constitutes the inductive component of learning, whereas fitting the training
derivatives extracted from the domain theory provides the analytical component.
For each training example EBNN uses its domain theory to explain the example, then extracts training
derivatives from this explanation.
For each attribute of the instance, a training derivative is computed that describes how the target
function value is influenced by a small change to this attribute value, according to the domain theory.
The two previous sections examined two different roles for prior knowledge in learning: initializing the
learner's hypothesis and altering the objective function that guides search through the hypothesis
space.
In this section we consider a third way of using prior knowledge to alter the hypothesis space search:
using it to alter the set of operators that define legal steps in the search through the hypothesis space.
This approach is followed by systems such as FOCL
The First Order Combined Learner (FOCL) Algorithm is an extension of the purely inductive, FOIL
Algorithm. It uses domain theory to further improve the search for the best-rule and greatly improves
accuracy.
First Order Inductive Learner (FOIL)
In machine learning, (FOIL) is a rule-based learning algorithm. It is a natural extension of SEQUENTIAL-
COVERING and LEARN-ONE-RULE algorithms
FOCL also tends to perform an iterative process of learning a set of best-rules to cover the
training examples and then remove all the training examples covered by that best rule. (using a
sequential covering algorithm)
However, what makes the FOCL algorithm more powerful is the approach that it adapts while
searching for that best-rule.
1. For each operational literal that is not part of h, create a specialization of h by adding this single literal
to the precondition s. This is also the method used by FOIL to generate candidate successors. The solid
arrows in Figure 12.8 denote this type of specialization
2. Create an operational, logically sufficient condition for the target concept according to the domain
theory. Add this set of literals to the current preconditions of h.
Finally, prune the preconditions of h by removing any literals that are unnecessary according to the
training data.