Unit Wise Important Question Unit Wise
Unit Wise Important Question Unit Wise
AI Applications
Agents
An agent is anything that can be viewed as perceiving its environment through
sensors and acting upon that environment through actuators
Human agent: eyes, ears, and other organs for sensors; hands, legs, mouth, and other
body parts for actuators.
Robotic agent: cameras and infrared range cameras are sensors; various motors,
hydraulic arms and legs are actuators.
Rational agents
• An agent should strive to "do the right thing", based on what it can perceive and the
actions it can perform. The right action is the one that will cause the agent to be most
successful
• Performance measure: An objective criterion for success of an agent's behavior
• E.g., performance measure of a vacuum-cleaner agent could be amount of dirt
cleaned up, amount of time taken, amount of electricity consumed, amount of noise
generated, etc.
• Rational Agent: For each possible percept sequence, a rational agent should select an
action that is expected to maximize its performance measure, given the evidence provided by
the percept sequence and whatever built-in knowledge the agent has.
• Rationality is distinct from omniscience (all-knowing with infinite knowledge)
PEAS
• PEAS: Performance, Environment, Actuators, Sensor Must first specify
Performance measure
Environment
Actuators
Sensors
•Must first specify the setting for intelligent agent design
Agent: Self Driving Cars
•Consider, e.g., the task of designing an automated driverless car:
Performance measure: Safe, fast, legal, comfortable trip, maximize profits
Environment: Roads, other traffic vehicles, pedestrians, customers
Actuators: Steering wheel, accelerator, brake, signal, horn
Sensors: Cameras, sonar, speedometer, GPS, odometer, engine sensors, keyboard
Program
Function REFLEX-VACUUM-AGENT([location,status]) returns an action
If status=dirty then
return suck
Else if location = A then
return right
Else if location = B then
return left
The vacuum agent program is very small. But some processing is done on the visual input to
establish the condition-action rule.
Program
Function SIMPLE-REFLEX-AGENT(percept) returns an action
Static: rules, a set of condition-action rules
State INTERPRET-INPUT
(percept) rule RULE-MATCH
(state, rule) action RULE-
ACTION [rule]
Return action
State Rule Action
Vacuum_on(A, Clean) Turn_left(Vacuum on) Right
Vacuum_off(A,Dirty) Turn_Right(Vacuum on) Suck
Vacuum_on(B, Clean) Turn_left(Vacuum on) Left
Vacuum_off(B,Dirty) Turn_Right(Vacuum on)
Turn_left(Vacuum off)
Turn_Right(Vacuum off)
INTERPRET-INPUT: generates an abstracted description of the current state from the percept
RULE-MATCH: returns the first rule in the set of rules that matches the given state description.
This agent will work only if the correct decision can be made on the basis of only the current
percept. i.e. only if the environment is fully observable.
Model-based reflex agents
To handle partial observability, the agent should maintain some sort of internal state
that depends on the percept history and thereby reflects at least some of the unobserved
aspects of the current state.
Updating this internal state information requires two kinds of knowledge to be
encoded in the agent program.
o How the world evolves independently of the agent
o How the agent’s actions affect the world.
This knowledge can be implemented in simple Boolean circuits called model of the
world. An agent that uses such a model is called a model-based agent.
The following figure shows the structure of the reflex agent with internal state,
showing how the current percept is combined with the old internal state to generate the
updated description of the current state.
The agent program is shown below:
Goal-based agents
Here, along with current-state description, the agent needs some sort of goal
information that describes situations that are desirable – for eg, being at the passenger’s
destination. Goal –based agents structure is shown below:
Utility-based agents
Goals alone are not enough to generate high-quality behaviour in most environments.
A more general performance measure should allow a comparison of different world states
according to exactly how happy they would make the agent if they could be achieved.
A utility function maps a state onto a real number, which describes the associated
degree of happiness. The utility-based agent structure appears in the following figure.
Learning agents
It allows the agent to operate in initially unknown environments and to become more
competent than its initial knowledge alone might allow. A learning agent can be divided into
four conceptual components, as shown in figure:
Learning element: responsible for making improvement Performance element:
responsible for selecting external actions The learning element uses feedback from the critic
on how the agent is doing and determines how the performance element should be modified
to do better in the future.
The critic tells the learning element how well the agent is doing with respect to a fixed
performance standard. The critic is necessary because the percepts themselves provide no
indication of the agent’s success. The last component of the learning agent is the problem
generator. It is responsible for suggesting actions that will lead to new and informative
experiences.
PROBLEM-SOLVING AGENTS
Problem solving agent is a goal-based agent decides what to do by finding sequences
of actions that lead to desirable states.
Let us take for an example, an Self Driving Car agent in the city of Arad, Romania,
enjoying a touring holiday.
Goal formulation
based on the current situation and the agent’s performance measure, is the first step in
problem solving.
We will consider a goal to be a set of world states- exactly those states in which the
goal is satisfied.
Problem formulation
is the process of deciding what actions and states to consider, given a goal. Let us
assume that the agent will consider actions at the level of driving from one major
town to another.
Our agent has now adopted the goal of driving to Bucharest, and is considering where
to go from Arad. There are three roads out of Arad.
The agent will not know which of its possible actions is best, because it does not
know enough about the state that results from taking each action.
If the agent has a map, it provides the agent with information about the states it might
get itself into, and the actions it can take.
An agent with several immediate options of unknown value can decide what to do by
first examining different possible sequences of actions that lead to states of known
value, and then choosing the best sequence.
The process of looking for such a sequence is called a search.
A search algorithm takes a problem as input and returns a solution in the form of an
action sequence.
Once a solution is found, the actions it recommends can be carried out. This is called
the execution phase.
The design for such an agent is shown in the following function:
function SIMPLE-PROBLEM-SOLVING-.AGENT (
percept) returns an action
static: seq, an action sequence, initially empty
state, some description of the current world state
goal, a goal, initially null
problem, a problem
formulation state UPDATE- STATE(
state, percept) if seq IS EMPTY
then do
goal FORMULATE- GOAL(state)
problem FORMULATE-PROBLEM(State,
goal) seq + SEARCH( problem)
action + FIRST(seq)
seq REST(seq)
return action
Example: Romania
An human lives in Romania; currently in Arad.
His Flight leaves tomorrow from Bucharest
Goal formulation
Formulate goal: be in Bucharest
states: various cities
actions: drive between cities
Find solution:sequence of cities, e.g., Arad, Sibiu, Fagaras, Bucharest
Problem Formulation
A problem is defined by four items:
initial state e.g., "at Arad"
actions or successor function S(x) = set of action–state pairs
e.g., S(Arad) = {<Arad Zerind, Zerind oradea,, oradea Sibiu, Sibu
Fagams, Fagams Bucharest }
goal test, can be
explicit, e.g., x = "at Bucharest"
implicit, e.g., Checkmate(x)
path cost (additive)
e.g., sum of distances, number of actions executed, etc.
c(x,a,y) is the step cost, assumed to be !A 0
solution is a sequence of actions leading from the initial state to a goal state
Selecting a state space
Real world is absurdly complex
state space must be abstracted for problem solving
(Abstract) state = set of real states
(Abstract) action = complex combination of real actions
e.g., "Arad Zerind" represents a complex set of possible routes, detours,
rest stops, etc.
For guaranteed realizability, any real state "in Arad“ must get to some real state "in
Zerind"
(Abstract) solution = set of real paths that are solutions in the real world.
Each abstract action should be "easier" than the original problem.
Types of Problem Solving Agents Based on States
Single-state problem
Single-state problem Complete world state knowledge Complete action knowledge
The agent always knows its world state
Goal formulation World states with certain properties
Definition of the state space
Definition of the actions that can change the world state.
Definition of the problem type, which depends on the knowledge of the world states
and actions states in the search space
Specification of the search costs (search costs, offline costs) and the execution costs
(path costs, online costs)
• States: The state is determined by both the agent location and the dirt locations. The
agent is in one of two locations, each of which might or might not contain dirt. Thus,
there are 2 × 22 = 8 possible world states. A larger environment with n locations has n
2n states.
•Initial state: Any state can be designated as the initial state.
•Actions: In this simple environment, each state has just three actions: Left, Right, and
Suck. Larger environments might also include Up and Down.
•Transition model: The actions have their expected effects, except that moving Left in
the leftmost square, moving Right in the rightmost square, and Sucking in a clean square
have no effect.
•Goal test: This checks whether all the squares are clean.
•Path cost: Each step costs 1, so the path cost is the number of steps in the pathIf the
environment is completely accessible, the vacuum cleaner always knows where it is and
where the dirt is.
The solution then is reduced to searching for a path from the initial state to the goal state.
If the environment is completely accessible, the vacuum cleaner always knows where it is
and where the dirt is. The solution then is reduced to searching for a path from the initial state
to the goal state.
Multi State Problem
Search algorithms:
Solving the formulated problem can be done by a search through the state space.
One of the search technique is an explicit search tree that is generated by the initial state
and the successor function that together define the state space.
From the initial state, produce all successive states step by step search tree.
Proble m Formulation
State: state in the state space
Parent-Node: Predecessor nodes
Action: The operator that generated the node
Depth: number of steps along the path from the initial state
Path Cost: Cost of the path from the initial state to the node
Operations on a queue:
Make-Queue(Elements): Creates a queue
Empty?(Queue): Empty test
First(Queue): Returns the first element of the queue
Remove-First(Queue): Returns the first element
Insert(Element, Queue): Inserts new elements into the queue
Insert-All(Elements, Queue): Inserts a set of elements into the queue
STATE vs NODE
A state is a (representation of) a physical configuration
A node is a data structure constituting part of a search tree includes state, parent node,
action, path cost g(x), depth
The node data structure is depicted in the following figure:
Breadth-first search
Breadth first search is a simple strategy in which the root node is expanded first, then
all the successors of the root node are expanded next, then their successors, and so on. All the
nodes are expanded at a given depth in the search tree before any nodes at the next level are
expanded.
Algorithm:
1. Place the starting node s on the queue.
2. If the queue is empty, return failure and stop.
3. If the first element on the queue is a goal node g, return success and
stop. Otherwise,
4. Remove and expand the first element from the queue and place all the children at the
end of the queue in any order.
5. Return to step 2.
By calling TREE-SEARCH with an empty fringe
fringe is a FIFO queue, i.e., new successors go at end
The following figure shows the progress of the search on a simple binary tree.
Figure: Breadth first search on a simple binary tree. At each state, the node to be
expanded next is indicated by a marker.
Properties of breadth-first search
Complete? Yes (if b is finite)
If the shallowest goal node is at some finite depth d, BFS will eventually find it after
expanding all shallower nodes (b is a branching factor)
Time? 1+b+b2+b3+… +bd + b(bd-1) = O(bd+1)
Space? O(bd+1) (keeps every node in memory)
We consider a hypothetical state space where every state has b successors. The root
of the search tree generates b nodes at the first level, each of which generates b more nodes,
for a total of b2 at the second level, and so on. Now suppose that the solution is at depth d.
Optimal? Yes (if cost = 1 per step)
BFS is optimal if the path cost is a nondecreasing funcion of the depth of the node.
Space is the bigger problem (more than time)
Depth-first search
Always expands an unexpanded node at the greatest depth
fringe = LIFO queue, i.e., put successors at front
Algorithm:
1. Place the starting node s on the queue.
2. If the queue is empty, return failure and stop.
3. If the first element on the queue is a goal node g, return success and
stop. Otherwise,
4. Remove and expand the first element , and place the children at the front of the queue
(in any order).
5. Return to step 2.
The progress of the search is illustrated in the following figure:
DFS on a binary tree. Nodes that have been expanded and have no descendants in the
fringe can be removed from memory; these are shown in black. Nodes at depth 3 are assumed
to have no successors and M is the only goal node.
Properties
Complete? No: fails in infinite-depth spaces, spaces with loops Modify to avoid repeated
states along path complete in finite spaces
Time? O(bm): terrible if m is much larger than d but if solutions are dense, It may be
much faster than breadth-first Search.
Space? O(bm), i.e., linear space!
Optimal? No
Uniform-cost search
BFS is optimal when all step costs are equal, because it always expands the
shallowest unexpanded node. Instead of expanding the shallowest node, Uniform-cost search
expands the node n with the lowest path cost.
fringe = queue ordered by path cost Equivalent to breadth-first if step costs all equal
Complete? Yes, if step cost >= e
Time? # of nodes with g <= cost of optimal solution, O(bceiling(C*/e)) where C* is the
cost of the optimal solution
Space? # of nodes with g <= cost of optimal solution, O(bceiling(C*/ e))
Optimal? Yes – nodes expanded in increasing order of g(n)
jenifer ([email protected])
Depth-limited search
The problem of unbounded trees can be alleviated by supplying DFS with a
predetermined depth limit.
= depth-first search with depth limit l,
i.e., nodes at depth l have no successors
Depth-limited search will also be nonoptimal if we choose l<d. Its time complexity is
O(bl) and its space complexity is O(bl).
Depth-limited search can terminate with two kinds of failure: the standard failure
value indicates no solution; the cutoff value indicates no solution within the depth
limit.
Iterative deepening combines the benefits of DFS and BFS. Like DFS, its memory
requirements are very modest:O(bd).
Like BFS, it is complete when the branching factor is finite and optimal when the
path cost is a nondecreasing function of the depth of the node.
The following figure shows four iterations of ITERATIVE-DEEPENING SEARCH
on a binary search tree, where the solution is found on the fourth iteration.
Number of nodes generated in an iterative deepening search to depth d with branching factor
b:NIDS = (d+1)bO + d b^1 + (d-1)b^2 + … + 3bd-2 +2bd-1 + 1bd
Properties
Complete? Yes
Time? (d+1)b0 + d b1 + (d-1)b2 + … + bd = O(bd)
Space? O(bd)
Optimal? Yes, if step cost = 1
Bidirectional Search
The idea behind bi-directional search is to run two simultaneous searches – one
forward from the initial state and the other backward from the goal, stopping when the two
searches meet in the middle.
Bidirectional search is implemented by having one or both of the searches check each
node before it is expanded to see if it is in the fringe of the other search tree; if so, a solution
has been found. Checking a node for membership in the other search tree can be done in
constant time with a hash table,
so the time complexity of bi-directional search is O(bd/2).
Atleast one of the search trees must be kept in memory so that the membership
check can be done, hence the space complexity is O(bd/2) which is the weakness of
the algorithm.
The algorithm is complete and optimal if both searches are breadth-first;
INFORMED SEARCH ALGORITHMS
Informed search strategy is the one that uses problem-specific knowledge beyond the
definition of the problem itself.
Best-first search
Greedy best-first search
A* search
Heuristics
Local search algorithms
Hill-climbing search
Simulated annealing search
Local beam search
Best-first search
Best first search is an instance of the general TREE-SEARCH or GRAPH-SEARCH
algorithm in which a node is selected for expansion based on an evaluation function f(n).
The node with the lowest evaluation is selected for expansion, because the evaluation
measures distance to the goal. It can be implemented using a priority queue, a
data structure that will maintain the fringe in ascending order of f – values.
Algorithm:
1. Place the starting node s on the queue.
2. If the queue is empty, return failure and stop.
3. If the first element on the queue is a goal node g, return success and stop. Otherwise,
4. Remove the first element from the queue, expand it and compute the estimated goal
distances for each child. Place the children on the queue(at either end) and arrange
all
queue elements in ascending order corresponding to goal distance from the front of the
queue.
5. Return to step 2. Best-first search uses different evaluation functions. A key component
of these algorithms is a heuristic function, denoted h(n)
h(n)= estimated cost of the cheapest path from node n to a goal node.
For example, in Romania, one might estimate the cost of the cheapest path from
Arad to Bucharest via the straight-line distance from Arad to Bucharest which is shown
below:
Romania with step costs in km
Optimality of A* (proof)
Suppose some suboptimal goal G2 has been generated and is in the fringe.
Let n be an unexpanded node in the fringe such that n is on a shortest path to
an optimal goal G.
Optimality of A*
A* expands nodes in order of increasing f value
Gradually adds "f-contours" of nodes
Contour i has all nodes with f=fi, where fi < fi+1
Properties
• Complete? Yes (unless there are infinitely many nodes with f <= f(G) )
• Time? Exponential
• Space? Keeps all nodes in memory
• Optimal? Yes
HEURISTIC FUNCTION
A good heuristic function is determined by its efficiency. More is the
information about the problem, more is the processing time.
Some toy problems, such as 8-puzzle, 8-queen, tic-tac-toe, etc., can be solved
more efficiently with the help of a heuristic function.
Consider the following 8-puzzle problem where we have a start state and a goal
state.
Our task is to slide the tiles of the current/start state and place it in an order
followed in the goal state.
There can be four moves either left, right, up, or down. There can be several
ways to convert the current/start state to the goal state, but, we can use a
heuristic function h(n) to solve the problem more efficiently.
E.g., for the 8-puzzle:
8-queens problem
It The aim of this problem is to place eight queens on a chessboard in an order
where no queen may attack another. A queen can attack other queens either
diagonally or in same row and column.
Q
1
Q
2
Q
3
Q
4
. Q
5
It is noticed from the above figure that each Q
queen is set into the chessboard in a position where no other queen is placed diagonally,
in same row or column. Therefore, it is one right approach to the 8-queens problem.
For this problem, there are two main kinds of formulation:
Incremental formulation: It starts from an empty state where the operator augments a
queen at each step.
States: Arrangement of any 0 to 8 queens on the chessboard.
Initial State: An empty chessboard
Actions: Add a queen to any empty box.
Transition model: Returns the chessboard with the queen added in a box.
Goal test: Checks whether 8-queens are placed on the chessboard without any
attack.
Path cost: There is no need for path cost because only final states are counted.
Complete-state formulation: It starts with all the 8-queens on the chessboard and
moves them around, saving from the attacks.
States: Arrangement of all the 8 queens one per column with no queen
attacking the other queen.
Actions: Move the queen at the location where it is safe from the attacks.
This formulation is better than the incremental formulation as it reduces the state
space from 1.8 x 1014 to 2057, and it is easy to find the solutions.
Some Real-world problems.
Properties
.Let's understand the working of a local search algorithm with the help of an example:
Consider the below state-space landscape having both:
The LSA explores the above landscape by finding the following two points:
Global Minimum: If the elevation corresponds to the cost, then the task is
to find the lowest valley, which is known as Global Minimum.
Global Maxima: If the elevation corresponds to an objective function, then it
finds the highest peak which is called as Global Maxima. It is the highest
point in the valley.
We will understand the working of these points better in Hill-climbing search.
Below are some different types of local searches:
Hill-climbing Search
Simulated Annealing
Local Beam Search
Global Maximum: It is the highest point on the hill, which is the goal state.
Local Maximum: It is the peak higher than all other peaks but lower than the global
maximum.
Flat local maximum: It is the flat area over the hill where it has no uphill or
downhill. It is a saturated point of the hill.
Shoulder: It is also a flat area where the summit is possible.
Current state: It is the current position of the person.
Types of Hill climbing search algorithm
There are following types of hill-climbing search:
Simple hill climbing
Steepest-ascent hill climbing
Stochastic hill climbing
Advantages
Local search algorithms use a very little or constant amount of memory as
they operate only on a single path.
Most often, they find a reasonable solution in large or infinite state spaces
where the classical or systematic algorithms do not work.
Drawbacks
Local Maxima: It is that peak of the mountain which is highest than all its
neighboring states but lower than the global maxima. It is not the goal
peak because there is another peak higher than it.
Plateau: It is a flat surface area where no uphill exists. It becomes difficult for
the climber to decide that in which direction he should move to reach the goal
point. Sometimes, the person gets lost in the flat area.
Ridges: It is a challenging problem where the person finds two or more local
maxima of the same height commonly. It becomes difficult for the person to
navigate the right point and stuck to that point itself.
Simulated Annealing
Simulated annealing is similar to the hill climbing algorithm. It works on the
current situation. It picks a random move instead of picking the best move. If the
move leads to the improvement of the current situation, it is always accepted as a step
towards the solution state, else it accepts the move having a probability less than 1.
It is also applied for factory scheduling and other large optimization tasks.
OPTIMIZATION PROBLEMS
Genetic Algorithms
Genetic Algorithms(GAs) are adaptive heuristic search algorithms that belong
to the larger part of evolutionary algorithms. Genetic algorithms are based on the ideas
of natural selection and genetics. These are intelligent exploitation of random search
provided with historical data to direct the search into the region of better performance
in solution space. They are commonly used to generate high-quality solutions for
optimization problems and search problems.
Genetic algorithms simulate the process of natural selection which means
those species who can adapt to changes in their environment are able to survive and
reproduce and go to next generation. In simple words, they simulate “survival of the
fittest” among individual of consecutive generation for solving a problem. Each
generation consist of a population of individuals and each individual represents a
point in search space and possible solution. Each individual is represented as a string
of character/integer/float/bits. This string is analogous to the Chromosome.
Foundation of Genetic Algorithms
Genetic algorithms are based on an analogy with genetic structure and
behaviour of chromosomes of the population. Following is the foundation of GAs
based on this analogy –
1. Individual in population compete for resources and mate
2. Those individuals who are successful (fittest) then mate to create more
offspring than others
3. Genes from “fittest” parent propagate throughout the generation, that is
sometimes parents create offspring which is better than either parent.
4. Thus each successive generation is more suited for their environment.
Search space
The population of individuals are maintained within search space. Each
individual represents a solution in search space for given problem. Each individual is
coded as a finite length vector (analogous to chromosome) of components. These
variable components are analogous to Genes. Thus a chromosome (individual) is
composed of several genes (variable components).
Fitness Score
A Fitness Score is given to each individual which shows the ability of an
individual to “compete”. The individual having optimal fitness score (or near
optimal) are sought.
The GAs maintains the population of n individuals (chromosome/solutions)
along with their fitness scores.The individuals having better fitness scores are given
more chance to reproduce than others. The individuals with better fitness scores are
selected who mate and produce better offspring by combining chromosomes of
parents. The population size is static so the room has to be created for new arrivals.
So, some individuals die and get replaced by new arrivals eventually creating new
generation when all the mating opportunity of the old population is exhausted. It is
hoped that over successive generations better solutions will arrive while least fit die.
Each new generation has on average more “better genes” than the individual
(solution) of previous generations. Thus each new generations have better “partial
solutions” than previous generations. Once the offspring produced having no
significant difference from offspring produced by previous populations, the population
is converged. The algorithm is said to be converged to a set of solutions for the
problem.
1) Selection Operator: The idea is to give preference to the individuals with good
fitness scores and allow them to pass their genes to successive generations.
Characters A-Z, a-z, 0-9, and other special symbols are considered as genes.
A string generated by these characters is considered as
chromosome/solution/Individual.
Fitness score is the number of characters which differ from characters in target string
at a particular index. So individual having lower fitness value is given more
preference.
Adversarial Search
Competitive environments, in which the agents’ goals are in conflict, give rise to
adversarial search problems- often known as games. In AI, “games” are usually of a rather
specialized kind – in which there are two agents whose actions must alternate and in which
the utility values at the end of the game are always equal and opposite.
A game can be formally defined as a kind of search problem with the following
components:
• The initial state, which includes the board position and identifies the player to move.
•A successor function, which returns a list of (move, state )pairs, each indicating a legal
move and the resulting state.
•A terminal test, which determines when the game is over. States where the game has
ended are called terminal states.
•A utility function, which gives a numeric value for the terminal states.
The initial state and the legal moves for each side define the game tree for the game.
The following figure shows part of the game tree for tic-tac-toe. From the initial state, MAX
has nine possible moves. Play alternates between MAX’s placing an X and MIN’s placing an
O until we reach leaf nodes corresponding to terminal states such that one player has three in
a row or all the squares are filled.
Game tree (2-player, deterministic, turns)
MINIMAX
Given a game tree, the optimal strategy can be determined by examining the minimax
value of each node, which we write as MINIMAX-VALUE(n). The minimax value of a node
is the utility of being in the corresponding state, assuming that both players play optimally
from there to the end of the game.
•Perfect play for deterministic games
•Idea: choose move to position with highest minimax value = best achievable payoff
against best play
E.g., 2-ply game:
Properties
• Complete? Yes (if tree is finite)
• Optimal? Yes (against an optimal opponent)
• Time complexity? O(bm):the maximum depth of the tree is m, and there are b legal moves
at each point
Space complexity? O(bm) (depth-first •With "perfect ordering," time complexity = O(bm/2)
ALPHA-BETA PRUNING
The problem with minimax procedure is that the number of game states it has to
examine is exponential in the number of moves. We can cut it in half using the technique
called alpha-beta pruning. When applied to a standard minimax tree, it returns the same move
as minimax would, but prunes away branches that cannot possibly influence the final
decision.
Consider again the two-ply game tree. The steps are explained in the following figure. The
outcome is that we can identify the minimax decision without ever evaluating two of the leaf
nodes.
α-β pruning example
The value of the root node is given by
MINIMAX-VALUE(root) = max(min(3,12,8), min(2,x,y), min(14,5,2))
= max(3,min(2,x,y),2)
=max(3,z,2) where z<=2
=3
x and y: two unevaluated
successors z: minimum of x and y
Properties
•Pruning does not affect final result
•Good move ordering improves effectiveness of pruning
•A simple example of the value of reasoning about which computations are relevant (a form of
metareasoning)
Why is it called α-β?
•α is the value of the best (i.e., highest-value) choice found so far at any choice point along
the path for max
•β is the value of the best (i.e., lowest-value) choice found so far at any choice point along the
path for min
•If v is worse than α, max will avoid it
prune that branch
branches at a node as soon as the value of the current node is known to be worse than the
current α or β value for MAX or MIN respectively.
The effectiveness of alpha-beta pruning is highly dependent on the order in which the
successors are examined.
UNIT II
PROBABILISTIC REASONING
ACTING UNDER UNCERTAINTY
Uncertainty
Let action At = leave for airport t minutes before flight Will
At get me there on time?
Problems:
1) partial observability (road state, other drivers' plans, etc.)
2) noisy sensors (KCBS track reports)
3) uncertainty in action outcomes (at tire, etc.)
4) immense complexity of modelling and predicting track
Hence a purely logical approach either
1) risks falsehood: A25 will get me there on time"
or 2) leads to conclusions that are too weak for decision making:
“A25 will get me there on time if there's no accident on the bridge and it doesn't rain
and my tires remain intact etc."
Methods for handling uncertainty
Default or nonmonotonic logic:
Assume my car does not have a at tire
Assume A25 works unless contradicted by evidence
Issues: What assumptions are reasonable? How to handle contradiction?
Rules with fudge factors:
Propositions
Think of a proposition as the event (set of sample points) where the proposition is true
Given Boolean random variables A and B:
Often in AI applications, the sample points are defined by the values of a set of random
variables, i.e., the sample space is the Cartesian product of the ranges of the variables With
Boolean variables, sample point = propositional logic model
The definitions imply that certain logically related events must have related
probabilities
De Finetti (1931): an agent who bets according to probabilities that violate these
axioms can be forced to bet so as to lose money regardless of outcome.
Syntax for propositions
Propositional or Boolean random variables
e.g., Cavity (do I have a cavity?)
Cavity =true is a proposition, also written cavity
Discrete random variables (finite or infinite)
e.g., Weather is one of á(sunny, rain, cloudy, snow ñ
Weather =rain is a proposition
Values must be exhaustive and mutually exclusive
Continuous random variables (bounded or unbounded)
e.g., Temp=21:6; also allow, e.g., Temp < 22:0.
Arbitrary Boolean combinations of basic propositions
Prior probability
Prior or unconditional probabilities of propositions
e.g., P(Cavity =true) = 0:1 and P(Weather =sunny) = 0:72
correspond to belief prior to arrival of any (new) evidence
Probability distribution gives values for all possible assignments:
P(Weather) = á0:72; 0:1, 0:08, 0:1ñ (normalized, i.e., sums to 1)
Joint probability distribution for a set of r.v.s gives the
probability of every atomic event on those r.v.s (i.e., every sample point)
P(Weather,Cavity) = a 4 X 2 matrix of values:
Every question about a domain can be answered by the joint distribution because every event
is a sum of sample points
Probability for continuous variables
Express distribution as a parameterized function of value:
P(X =x) = U[18; 26](x) = uniform density between 18 and 26
Gaussian density
Conditional probability
Conditional or posterior probabilities
e.g., P(cavity/toothache) = 0:8
i.e., given that toothache is all I know
NOT “if toothache then 80% chance of cavity"
P(Cavity/Toothache) = 2-element vector of 2-element vectors)
If we know more, e.g., cavity is also given, then we have
P(cavity/toothache, cavity) = 1
Note: the less specific belief remains valid after more evidence arrives,
but is not always useful
New evidence may be irrelevant, allowing simplification,
e.g., P(cavity/toothache; 49ersWin) = P(cavity/toothache) = 0:8
This kind of inference, sanctioned by domain knowledge, is crucial
Let X be all the variables. Typically, we want the posterior joint distribution of the query
variables Y given specific values e for the evidence variables E
Let the hidden variables be H = X - Y - E
Then the required summation of joint entries is done by summing out the hidden
variables:
The terms in the summation are joint entries because Y, E, and H together
exhaust the set of random variables
Obvious problems:
1) Worst-case time complexity O(dn) where d is the largest arity
2) Space complexity O(dn) to store the joint distribution
3) How to find the numbers for O(dn) entries???
Independence
P(X|
Y)P(Y)
P(Y|X)
= αP(X|Y)P(Y)
𝑃(𝑋)
=
P(Caust|Effect)=
P(Effect|Cause)P(Cause)
P(Effect)
𝑃(𝑠|𝑀)𝑃(𝑀)
E.g., let M be meningitis, S be stiff neck:
𝑃(𝑆
P(m|s)= =(0.8*0.0001)/0.1
=0.0008 )
Bayes’ Theorem finds the probability of an event occurring given the probability of
another event that has already occurred. Bayes’ theorem is stated mathematically as
𝑃(𝑏|𝑎)𝑃(𝑎)
the following equation:
Bayes
𝑃(𝑏)
rule P(a|b) =
where, y is class variable and X is a dependent feature vector (of size n) where:X=x1,x2,
…..Xn
Just to clear, an example of a feature vector and corresponding class variable
can be: (refer 1st row of dataset)
Naive assumption
Now, its time to put a naive assumption to the Bayes’ theorem, which
is, independence among the features. So now, we split evidence into the
independent parts.
Now, if any two events A and B are independent, then,
P(A,B) = P(A)P(B)
Hence, we reach to the result:
Now, as the denominator remains constant for a given input, we can remove
that term:
Now, we need to create a classifier model. For this, we find the probability of
given set of inputs for all possible values of the class variable y and pick up the
output with maximum probability. This can be expressed mathematically as:
So, finally, we are left with the task of calculating P(y) and P(xi | y).
Please note that P(y) is also called class probability and P(xi | y) is
called conditional probability.
The different naive Bayes classifiers differ mainly by the assumptions they
make regarding the distribution of P(xi | y).
Let us try to apply the above formula manually on our weather dataset. For this,
we need to do some precomputations on our dataset.
We need to find P(xi | yj) for each xi in X and yj in y. All these calculations have
been demonstrated in the tables below:
So, in the figure above, we have calculated P(xi | yj) for each xi in X and yj in
y manually in the tables 1-4. For example, probability of playing golf given that the
temperature is cool, i.e P(temp. = cool | play golf = Yes) = 3/9.
Also, we need to find class probabilities (P(y)) which has been calculated in the
table 5. For example, P(play golf = Yes) = 9/14.
So now, we are done with our pre-computations and the classifier is ready!
Let us test it on a new set of features (let us call it today):
Since, P(today) is common in both probabilities, we can ignore P(today) and find
proportional probabilities as:
and
Now, since
These numbers can be converted into a probability by making the sum equal to 1
(normalization):
and
Since
Problem:
Solution:
o The Bayesian network for the above problem is given below. The
network structure is showing that burglary and earthquake is the
parent node of the alarm and directly affecting the probability of
alarm's going off, but David and Mary calls depend on alarm
probability.
o The network is representing that our assumptions do not directly
perceive the burglary and also do not notice the minor earthquake,
and they also not confer before calling.
o The conditional distributions for each node are given as conditional
probabilities table or CPT.
o Each row in the CPT must be sum to 1 because all the entries in the
table represent an exhaustive set of cases for the variable.
o In CPT, a boolean variable with k boolean parents contains 2K
probabilities. Hence, if there are two parents, then CPT will contain 4
probability values
o Burglary (B)
o Earthquake(E)
o Alarm(A)
o John Calls(J)
o Mary calls(M)
From the formula of joint distribution, we can write the problem statement
in the form of probability distribution:
= 0.00068045.
Hence, a Bayesian network can answer any query about the domain by using Joint
distribution.
network.
Slightly intelligent way to sum out variables from the joint without actually constructing
its explicit representation
Enumeration algorithm
Evaluation tree
Variable elimination: carry out summations right-to-left, storing intermediate results (factors)
to avoid recomputation
Variable elimination: Basic operations
Summing out a variable from a product of factors: move any constant factors outside
the summation add up submatrices in pointwise product of remaining factors
Rejection sampling
Used to compute conditional probabilities P(X|e)
Generate samples as before
Reject samples that do not match evidence
Estimate by counting the how often event X is in the resulting samples
Analysis of rejection sampling
Likelihood weighting
• fix Values for evidence variables E,
• sample only Remaining variables X and Y
• This guarantees that each event generated is consistent with evidence.
• Before tallying the count in the distribution for the query variable.
• Each event is weighted by the likelihood that the event accords to the evidence.
• As measured by the product of Conditional probabilities for each evidence variables.
Likelihood weighting example
Likelihood weighting analysis
Markov chain Monte Carlo (MCMC):
Direct Sampling
The simplest kind of random sampling process for Bayesian networks generates
events from a network that has no evidence associated with it.
The idea is to sample each variable in turn, in topological order.
The probability distribution from which the value is sampled is conditioned on the
values already assigned to the variable’s parents.
Example
1. Sample from P(Cloudy) = 0.5, 0.5, value is true.
2. Sample from P(Sprinkler |Cloudy =true) = 0.1, 0.5, value is false.
3. Sample from P(Rain |Cloudy =true) = 0.8, 0.2, value is true.
4. Sample from P(WetGrass | Sprinkler =false, Rain =true) = 0.9, 0.1, value is
true.
SPs(true, false, true, true)
[Cloudy, Sprinkler, Rain, Wetgrass]
Causal inference
The majority opinion is that there is nothing special about a causal interpretation, that is,
one which asserts that corresponding to each (non-redundant) direct arc in the network
not only is there a probabilistic dependency but also a causal dependency. after all, by
reordering the variables and applying the network construction algorithm we can get the
arcs turned around! Yet, clearly, both networks cannot be causal.
That causal structure is what underlies all useful Bayesian networks. Certainly not all
Bayesian networks are causal, but if they represent a real-world probability distribution,
then some causal model is their source. Regardless of how that debate falls out, however,
it is important to consider how to do inferences with Bayesian networks that are causal.
If we have a causal model, then we can perform inferences which are not available with a
non-causal BN. This ability is important, for there is a large range of potential
applications for particularly causal inferences, such as process control, manufacturing and
decision support for medical intervention.
For Example, Consider again Pearl’s earthquake network. That network is intended to
represent a causal structure: each link makes a specific causal claim. Since it is a
Bayesian network (causal or not), if we observe that JohnCalls is true, then this will raise
the probability of MaryCalls being true, as we know. However, if we intervene, somehow
forcing John to call, this probability raising inference will no longer be valid. Why?
Because the reason an observation raises the probability of Mary calling is that there is a
common cause for both, the Alarm; so one provides evidence for the other.
However, under intervention we have effectively cut off the connection between the
Alarm and John’s calling. The belief propagation (message passing) from JohnCalls to
Alarm and then down to MaryCalls is all wrong under suggests that we understand the
“effectively cut off” above quite literally, and model causal intervention in a variable
simply by (temporarily) cutting all arcs from.
If you do that with the earthquake example then, of course, you will find that forcing John
to call will tell us nothing about earthquakes, burglaries, the Alarm or Mary — which is
quite correct. This is the simplest way to model causal interventions and often will do the
job.
UNIT III
INTRODUCTION TO MACHINE LEARNING
Every time we buy a product, every time we rent a movie, visit a web page, write a
blog, or post on the social media, even when we just walk or drive around, many data are
generated.
Each of us is not only a generator but also a consumer of data. One also want to have
products and services specialized for us. Each of our needs has to be understood and interests
to be predicted.
for example, of a supermarket chain that is selling thousands of goods to millions of
customers either at hundreds of brick-and-mortar stores all over a country or through a virtual
store over the web. The details of each transaction are stored: date, customer id, goods bought
and their amount, total money spent, and so forth. This typically amounts to a lot of data
every day. What the supermarket chain wants is to be able to predict which customer is likely
to buy which product, to maximize sales and profit. Similarly each customer wants to find the
set of products best matching his/her needs.
Applications of ML
Mine databases to obtain Important Information.
To Reduce the Feature space of Large Databases.
To optimize the Learning Principle by Repetitive learning to
improve performance.
To predict and classify type one Normal data from the type two abnormal
data in dataset by using Learning Models on dataset.
Application of machine learning methods to large databases is called
data mining
Applications of ML in Real world App
1. In retail Firms- To analyse and buying pattern of goods purchased together.
2. In credit card applications- It is used to detect fraud credit cards by analysing its
credentials,
3. In finance banks- It is used to analyse their past data to build models for to see, the
predictiveness on loan approvals, based many loan payment details.
4. In the stock market- It is used to predict recommended Firm’s Shares to be purchased
at correct time based on previous share up and down price.
5. In manufacturing- learning models are used for optimization, control, and
troubleshooting.
6. In medicine, learning programs are used for medical diagnosis to classify Healthy
patients from abnormal patients.
7. In telecommunications, call patterns are analyzed for network optimization and
maximizing the quality of service.
8. In science, large amounts of data in physics, astronomy, and biology can only be
analyzed fast enough by computers.
9. The World Wide Web is huge; it is constantly growing, and searching for relevant
information cannot be done manually.
Data Mining- The analogy is that a large volume of earth and raw material is extracted from
a mine, which when processed leads to a small amount of very precious material; similarly, in
data mining, a large volume of data is processed to extract simple valuable information.
Artificial Intelligence in ML
Machine learning is also a part of artificial intelligence. that it has the ability to learn in a
changing environment. If the system can learn and adapt to such changes, the system designer
need not foresee and provide solutions for all possible situations. Machine learning also helps
us to find solutions for many problems in Computer vision, speech recognition, and robotics.
Figure Example of a training dataset where each circle corresponds to one data instance with
input values in the corresponding axes and its sign indicates the class. For simplicity, only
two customer attributes, income and savings, are taken as input and the two classes are low-
risk (‘+’) and high-risk (‘−’). An example discriminant that separates the two types of
examples is also shown. and high-risk. From this perspective, we can see classification as
learning an association from X to Y. Then for a given X = x, if we have P(Y = 1|X = x) = 0.8,
we say that the customer has an 80 percent probability of being high-risk, or equivalently a
20 percent probability of being low-risk. We then decide whether to accept or refuse the loan
depending on the possible gain and loss.
pattern recognition is one of the field of machine learning, It identifies similar
repetitive patterns from the data space,
for example a optical character recognition, an problem in which it has to
recognizing character codes from their images. In this example there were multiple classes, as
many as there are characters we would like to recognize. Especially interesting is the case
when the characters are handwritten—
for we take samples from writers and learn a definition of A-ness from these examples.
But though we do not know what it is that makes an image an ‘A’, we are certain that all those
distinct ‘A’s have some key properties in common, which is what we want to extract from the
examples. We know that a character image is a collection of strokes and has a regularity that
we can capture by a learning program. If we are reading a text, one factor we can make use of
is the redundancy in human languages. A word is a sequence of characters and successive
characters are not independent but are constrained by the words of the language. This has the
advantage that even if we cannot recognize a character, we can still read the word. Such
contextual dependencies may also occur in higher levels, between words and sentences,
through the syntax and semantics of the language. There are machine learning algorithms to
learn sequences and model such dependencies.
In face recognition, the input is an image, the classes are people to be recognized, and
the learning program should learn to associated the face images to identities. This problem is
more difficult than optical character recognition because there are more classes, input image
is larger, and a face is three-dimensional and differences in pose and lighting cause
significant changes in the image. There may also be occlusion of certain inputs; for example,
glasses may hide the eyes and eyebrows, and a beard may hide the chin.
In medical diagnosis, the inputs are the relevant information we have about the patient
and the classes are the illnesses. The inputs contain the patient’s age, gender, past medical
history, and current symptoms. Some tests may not have been applied to the patient, and thus
these inputs would be missing. Tests take time, may be costly, and may inconvenience the
patient so we do not want to apply them unless we believe that they will give us valuable
information. In the case of a medical diagnosis, a wrong decision may lead to a wrong or no
treatment, and in cases of doubt it is preferable that the classifier reject and defer decision to a
human expert.
In speech recognition, the input is acoustic and the classes are words that can be
uttered. This time the association to be learned is from an acoustic signal to a word of some
language. Different people, because of differences in age, gender, or accent, pronounce the
same word differently, which makes this task rather difficult. Another difference of speech is
that the input is temporal; words are uttered in time as a sequence of speech phonemes and
some words are longer than others. Acoustic information only helps up to a certain point, and
as in optical character recognition, the integration of a “language model” is critical in speech
recognition, and the best way to come up with a language model is again by learning it from
some large corpus of example data.
In natural language processing, Spam filtering is one where spam generators on one
side and filters on the other side keep finding more and more ingenious ways to outdo each
other. Summarizing large documents is another interesting example, yet another is analyzing
blogs or posts on social networking sites to extract “trending” topics or to determine what to
advertise. Perhaps the most impressive would be machine translation. After decades of
research on hand-coded translation rules, it has become apparent that the most promising way
is to provide a very large number of example pairs of texts in both languages and have a
program figure out automatically the rules to map one to the other.
Biometrics is recognition or authentication of people using their physiological and/or
behavioral characteristics that requires an integration of inputs from different modalities.
Examples of physiological characteristics are images of the face, fingerprint, iris, and palm;
examples of behavioral characteristics are dynamics of signature, voice, gait, and key stroke.
As opposed to the usual identification procedures—photo, printed signature, or password—
when there are many different (uncorrelated) inputs, forgeries (spoofing) would be more
difficult and the system would be more accurate, hopefully without too much inconvenience
to the users. Machine learning is used both in the separate recognizers for these different
modalities and in the combination of their decisions to get an overall accept/reject decision,
taking into account how reliable these different sources are.
Knowledge Extraction
Learning a rule from data also allows knowledge extraction. The rule is a simple
model that explains the data, and looking at this model we have an explanation about the
process underlying the data. For example, once we learn the discriminant separating low-risk
and high- risk customers, we have the knowledge of the properties of low-risk customers. We
can then use this information to target potential low-risk customers more efficiently, for
example, through advertising. Learning also performs program optimizes the parameters, θ,
such that the approximation error is minimized, that is, our estimates are as close as possible
to the correct values given in the training set.
For example in figure, the model is linear, and w and w0 are the parameters optimized
for best fit to the training data. In cases where the linear model is too restrictive,
A training dataset of used cars and the function fitted. For simplicity, mileage is
taken as the only input attribute and a linear model is used.
y = w2x2 + w1x + w0
Another example of regression is navigation of a mobile robot, for example, an
autonomous car, where the output is the angle by which the steering wheel should be turned
at each time, to advance without hitting obstacles and deviating from the route. Inputs in such
a case are provided by sensors on the car—for example, a video camera, GPS, and so forth.
Training data can be collected by monitoring and recording the actions of a human driver.
We can envisage other applications of regression where we are trying to optimize a
function.1 Let us say we want to build a machine that roasts coffee. The machine has many
inputs that affect the quality: various settings of temperatures, times, coffee bean type, and so
forth. We make a number of experiments and for different settings of these inputs, we
measure the quality of the coffee, for example, as consumer satisfaction. To find the optimal
setting, we fit a regression model linking these inputs to coffee quality and choose new points
to sample near the optimum of the current model to look for a better configuration. We
sample these points, check quality, and add these to the data and fit a new model. This is
generally called response surface design.
Sometimes instead of estimating an absolute numeric value, we want to be able to
learn relative positions. For example, in a recommendation system for movies, we want to
generate a list ordered by how much we believe the user is likely to enjoy each. Depending
on the movie attributes such as genre, actors, and so on, and using the ratings of the user
he/she has already seen, we would like to be able to learn a ranking function that we can then
use to choose among new movies.
SUPERVISED LEARNING
The task of supervised learning is this:
Given a training set of N example input–output pairs
(x1, y1), (x2, y2), . . . (xN, yN) ,
where each yj was generated by an unknown function y = f(x), discover a function h that
approximates the true function f.
Here x and y can be any value; they need not be numbers. The function h is a hypothesis.
Learning is a search through the space of possible hypotheses for one that will perform well,
even on new examples beyond the training set. To measure the accuracy of a hypothesis we
give it a test set of examples that are distinct from the training set.
a. CLASSIFICATION
When the output y is one of a finite set of values (such as sunny, cloudy or rainy), the
learning problem is called classification, and is called Boolean or binary classification if
there are only two values.
b. REGRESSION
When y is a number (such as tomorrow’s temperature), the learning problem is called
regression. (Technically, solving a regression problem is finding a conditional expectation or
average value of y, because the probability that we have found exactly the right real-valued
number for y is 0.)
1. CLASSIFICATION
Learning a Class from Examples
Let us say we want to learn the class, C, of a “family car.” We have a set of examples
of cars, and we have a group of people that we survey to whom we show these cars. The
people look at the cars and label them; the cars that they believe are family cars are positive
examples, and the other cars are negative examples. Class learning is finding a description
that is shared by all the positive examples and none of the negative examples.
Figure Training set for the class of a “family car.” Each data point corresponds to one
example car, and the coordinates of the point indicate the price and engine power of that car.
‘+’ denotes a positive example of the class (a family car), and ‘−’ denotes a negative example
(not a family car); it is another type of car.
Let us denote price as the first input attribute x1 and engine power as the second attribute x2
(e.g., engine volume in cubic centimeters). Thus we represent each car using two numeric
values
Each car is represented by such an ordered pair (x, r) and the training set contains N such
examples
where t indexes different examples in the set; it does not represent time or any such order.
Figure Example of a hypothesis class. The class of family car is a rectangle in the price-
engine power space.
In Our training data can now be plotted in the two-dimensional (x1, x2) space where
each instance t is a data point at coordinates (xt type, namely, positive versus negative, is
t
given by r After further discussions with the expert and the analysis of the data, we may
have reason to believe that for a car to be a family car, its price and engine power should be
in a certain range
for suitable values of p1, p2, e1, and e2. Equation thus assumes C to be a rectangle in
the price-engine power space Equation fixes H, the hypothesis class from which we believe C
is drawn, namely, the set of rectangles. The learning algorithm then finds the particular
In real life we do not know C(x), so we cannot evaluate how well h(x) matches C(x).
What we have is the training set X, which is a small subset of the set of all possible x. The
empirical error is the proportion of training instances where predictions of h do not match the
required values given in X. The error of hypothesis h given the training set X is
2. REGRESSION
Linear Regression Models
In classification, for an given an input, the output that is generated is Boolean; it is a
yes/no answer. In Regression the output is a numeric value, what we would like to learn is
not a class, but is a numeric function r.
Here we would like to write the numeric output, called the dependent variable, as a
function of the input, called the independent variable. We assume that the numeric output is
the sum of a deterministic function of the input and random noise:
r=f(x)+ ϵ
where f (x) is the unknown function, which we would like to approximate by our
estimator, g(x|θ), defined up to a set of parameters θ. If we assume that ϵ is zero mean
Gaussian with constant variance σ2, namely, ϵ ∼ N(0,σ2).
The simplest form of linear regression models are also linear functions of the input
variables. However, we can obtain a much more useful class of functions by taking linear
combinations of a fixed set of nonlinear functions of the input variables, known as basis
functions. Such models are linear functions of the parameters, which gives them simple
analytical properties, and yet can be nonlinear with respect to the input variables.
y(x,w) = w0 + w1x1 + . . . + wDxD
where x = (x1, . . . , xD)T.
The linear function of the parameters w0, . . . , wD. It is also, If one extend the class of
models by considering linear combinations of fixed nonlinear functions of the input variables,
of the form
where φj(x) are known as basis functions. By denoting the maximum value of the index j by
M − 1, the total number of parameters in this model will be M.
Least Squares
One can maximize the likely-hood by minimizing sum of- squares error function.
Here we show that this error function could be motivated as the maximum likelihood solution
appears under an assumed Gaussian noise model. So consider the least squares approach, and
its relation to maximum likelihood, in more detail. As before, we assume that the target
variable t is given by a deterministic function y(x,w) with additive Gaussian noise so that
where ϵ is a zero mean Gaussian random variable with precision (inverse variance) β.
Thus we can write
Recall that, if we assume a squared loss function, then the optimal prediction, for a new value
of x, will be given by the conditional mean of the target variable. In the case of a Gaussian
conditional distribution of the form, the conditional mean will be simply
Note that the Gaussian noise assumption implies that the conditional distribution of t given x
is unimodal, which may be inappropriate for some applications. An extension to mixtures of
conditional Gaussian distributions, which permit multimodal conditional distributions, Now
consider a data set of inputs X = {x1, . . . , xN} with corresponding target values t1, . . . , tN.
We group the target variables {tn} into a column vector that we denote by t where the
typeface is chosen to distinguish it from a single observation of a multivariate target, which
would be denoted t. Making the assumption that these data points are drawn independently
from the distribution, we obtain the following expression for the likelihood function, which is
a function of the adjustable parameters w and β, in the form.
Thus x will always appear in the set of conditioning variables, and so from now on we will
drop the explicit x from expressions such as p(t|x,w, β) in order to keep the notation
uncluttered. Taking the logarithm of the likelihood function, and making use of the standard
form for the univariate Gaussian,
Draw back
Batch techniques, such as the maximum likelihood solution, which involve processing the
entire training set in one go, can be computationally costly for large data sets.
where τ denotes the iteration number, and η is a learning rate parameter. We shall discuss the
choice of value for η shortly. The value of w is initialized to some starting vector w(0). For
the case of the sum-of-squares error function (3.12), this gives
where φn = φ(xn). This is known as least-mean-squares or the LMS algorithm. The value of
η needs to be chosen with care to ensure.
Single & multiple variables
So far, we have considered the case of a single target variable t. In some applications,
we may wish to predict K > 1 target variables, which we denote collectively by the target
vector
t. This could be done by introducing a different set of basis functions for each component of
t, leading to multiple, independent regression problems. However, a more interesting, and
more common, approach is to use the same set of
basis functions to model all of the components of the target vector so that
If we have a set of observations t1, . . . , tN, we can combine these into a matrix T of
size N × K such that the nth row is given by tTn . Similarly, we can combine the input
vectors x1, . . . , xn into a matrix X. The log likelihood function is then given by
having mean m0 and covariance S0 the likelihood function and the prior. Due to the
choice of a conjugate Gaussian prior distribution, the posterior will also be Gaussian. We can
evaluate this distribution by the usual procedure of completing the square in the exponential,
and then finding the normalization coefficient using the standard result for a normalized
Gaussian. which allows us to write down the posterior distribution directly in the form
Note that because the posterior distribution is Gaussian, its mode coincides with its
mean. Thus the maximum posterior weight vector is simply given by wMAP = mN. If we
consider an infinitely broad prior S0 = α−1I with α → 0, the mean mN of the posterior
distribution reduces to the maximum likelihood value wML. Similarly, if N = 0, then the
posterior distribution reverts to the prior. Furthermore, if data points arrive sequentially, then
the posterior distribution at any stage acts as the prior distribution for the subsequent data
point, such that the new posterior distribution. Specifically, we consider a zero-mean
isotropic Gaussian governed by a single precision parameter α so that
The log of the posterior distribution is given by the sum of the log likelihood and the log of
the prior and, as a function of w, takes the form
a. Discriminant Functions
A discriminant is a function that takes an input vector x and assigns it to one of K
classes, denoted Ck. Here, we shall restrict attention to linear discriminants, namely those for
which the decision surfaces are hyperplanes. To simplify the discussion, we consider first the
case of two classes and then investigate the extension to K >2 classes.
The simplest representation of a linear discriminant function is obtained by taking a
linear function of the input vector so that
where w is called a weight vector, and w0 is a bias (not to be confused with bias in the
statistical sense). The negative of the bias is sometimes called a threshold. An input vector x
is assigned to class C1 if y(x) ≥ 0 and to class C2 otherwise. The corresponding decision
boundary is therefore defined by the relation y(x) = 0, which corresponds to a (D − 1)-
dimensional hyperplane within the D-dimensional input space. Consider two points xA and xB
both of which lie on the decision surface. Because y(xA) = y(xB) = 0, we have wT(xA−xB) = 0
and hence the
vector w is orthogonal to every vector lying within the decision surface, and sow determines
the orientation of the decision surface. Similarly, if x is a point on the decision surface, then
y(x) = 0, and so the normal distance from the origin to the decision surface is given by
We therefore see that the bias parameter w0 determines the location of the decision surface.
These properties are illustrated for the case of D = 2 in Figure Furthermore, we note that the
value of y(x) gives a signed measure of the perpendicular distance r of the point x from the
decision surface. To see this, consider
Multiplying both sides of this result bywT and adding w0, and making use of y(x) =
wTx + w0 and y(x⊥) = wTx⊥ + w0 = 0, we have
This result is illustrated in Figure
Setting the derivative with respect to, to zero, and rearranging, we then obtain the
solution for , in the form
where is the pseudo-inverse of the matrix , So the discriminant function takes the form
for some constants a and b, then the model prediction for any value of x will satisfy
the same constraint so that
Thus if we use a 1-of-K coding scheme for K classes, then the predictions made by the
model will have the property that the elements of y(x) will sum to 1 for any value of x.
However, this summation constraint alone is not sufficient to allow the model outputs to be
interpreted as probabilities because they are not constrained to lie within the interval (0, 1).
The least-squares approach gives an exact closed-form solution for the discriminant function
parameters. However, even as a discriminant function it suffers from some severe problems.
We have already seen that least-squares solutions lack robustness to outliers, and this applies
equally to the classification application, as illustrated in Figure
The left plot shows data from two classes, denoted by red crosses and blue circles,
together with the decision boundary found by least squares (magenta curve)
b. Probabilistic Discriminative Models
For the two-class classification problem, we have seen that the posterior probability of
class C1 can be written as a logistic sigmoid acting on a linear function of x, for a wide
choice of class-conditional distributions p(x|Ck). Similarly, for the multiclass case, the
posterior probability of class Ck is given by a softmax transformation of a linear function of
x. For specific choices of the class-conditional densities p(x|Ck), we have used maximum
likelihood to determine the parameters of the densities as well as the class priors p(Ck) and
then used Bayes’ theorem to find the posterior class probabilities.
However, an alternative approach is to use the functional form of the generalized linear
model explicitly and to determine its parameters directly by using maximum likelihood. We
shall see that there is an efficient algorithm finding such solutions known as iterative
reweighted least squares, or IRLS.
The indirect approach to finding the parameters of a generalized linear model, by
fitting class-conditional densities and class priors separately and then applying Bayes’
theorem, represents an example of generative modelling, because we could take such a model
and generate synthetic data by drawing values of x from the marginal distribution p(x). In the
direct approach, we are maximizing a likelihood function defined through the conditional
distribution p(Ck|x), which represents a form of discriminative training. One advantage of the
discriminative approach is that there will typically be fewer adaptive parameters to be
determined, as we shall see shortly. It may also lead to improved predictive performance,
particularly when the class-conditional density assumptions give a poor approximation to the
true distributions.
Figure of the role of nonlinear basis functions in linear classification models. The left plot
shows the original input space (x1, x2) together with data points from two classes labelled red
and blue. Two ‘Gaussian’ basis functions φ1(x) and φ2(x) are defined in this space with
centres shown by the green crosses and with contours shown by the green circles.
Logistic regression
We begin our treatment of generalized linear models by considering the problem of
two-class classification. In generative approaches, we saw that under rather general
assumptions, the posterior probability of class C1 can be written as a logistic sigmoid acting
on a linear function of the feature vector
with p(C2|φ) = 1 − p(C1|φ). Here σ(・) is the logistic sigmoid function. In the terminology of
statistics, this model is known as logistic regression, although it should be emphasized that
this is a model for classification rather than regression. For an M-dimensional feature space φ,
this
model has M adjustable parameters. By contrast, if we had fitted Gaussian class conditional
densities using maximum likelihood, we would have used 2M parameters for the means and
M(M + 1)/2 parameters for the (shared) covariance matrix. Together with the class prior
p(C1), this gives a total of M(M+5)/2+1 parameters, which grows quadratically with M, in
contrast to the linear dependence on M of the number of parameters in logistic regression. For
large values of M, there is a clear advantage in working with the logistic regression model
directly.
We now use maximum likelihood to determine the parameters of the logistic regression
model. To do this, we shall make use of the derivative of the logistic sigmoid function, which
can conveniently be expressed in terms of the sigmoid function itself
For a data set {φn, tn}, where tn ∈ {0, 1} and φn = φ(xn), with n =1, . . . , N, the
likelihood function can be written
where t = (t1, . . . , tN)T and yn = p(C1|φn). As usual, we can define an error function by
taking the negative logarithm of the likelihood, which gives the crossentropy error function in
the form
where yn = σ(an) and an = wTφn. Taking the gradient of the error function with respect to w,
we obtain
We see that the factor involving the derivative of the logistic sigmoid has cancelled, leading
to a simplified form for the gradient of the log likelihood. In particular, the contribution to the
gradient from data point n is given by the ‘error’ yn − tn between the target value and the
prediction of the model, times the basis function vector φn. Furthermore, this takes precisely
the same form as the gradient of the sum-of-squares error function for the linear regression
model.
The Figure plot shows the corresponding feature space (φ1, φ2) together with the
linear decision boundary obtained given by a logistic regression model of the form. This
corresponds to a nonlinear decision boundary in the original input space, shown by the black
curve in the left-hand plot.
c. Probabilistic Generative Models
In this approach, we model the class-conditional densities p(x|Ck), as well as the class
priors p(Ck), and then use these to compute posterior probabilities p(Ck|x) through Bayes’
theorem. Consider first of all the case of two classes. The posterior probability for class C1
can be written as.
Figure Plot of the logistic sigmoid function σ(a), shown in red, together with the scaled
probit function Φ(λa), for λ2 = π/8, shown in dashed blue, The scaling factor π/8 is chosen so
that the derivatives of the two curves are equal for a = 0.
and σ(a) is the logistic sigmoid function defined by which is plotted in Figure. The term
‘sigmoid’ means S-shaped. This type of function is sometimes also called a ‘squashing
function’ because it maps the whole real axis into a finite interval and It plays an important
role in many classification algorithms. It satisfies the following symmetry property
as is easily verified. The inverse of the logistic sigmoid is given by
and is known as the logit function. It represents the log of the ratio of probabilities ln
[p(C1|x)/p(C2|x)] for the two classes, also known as the log odds.
we have simply rewritten the posterior probabilities in an equivalent form, and so the
appearance of the logistic sigmoid may seem rather vacuous. it will have
significance provided a(x) takes a simple functional
form.
Figure The margin is defined as the perpendicular distance between the decision boundary
and the closest of the data points, as shown on the left figure. Maximizing the margin leads to
a particular choice of decision boundary, as shown on the right. The location of this boundary
is determined by a subset of the data points, known as support vectors, which are indicated by
the circles. having a common parameter σ2. Together with the class priors, this defines an
optimal misclassification-rate decision boundary. However, instead of using this optimal
boundary, they determine the best hyperplane by minimizing the probability of error relative
to the learned density model. In the limit σ2 → 0, the optimal hyperplane is shown to be the
one having maximum margin. The intuition behind this result is that as σ2 is reduced, the
hyperplane is increasingly dominated by nearby data points relative to more distant ones. In
the limit, the hyperplane becomes independent of data points that are not support vectors.
From the above figure that marginalization with respect to the prior distribution of the
parameters in a Bayesian approach for a simple linearly separable dataset leads to a decision
boundary that lies in the middle of the region separating the data points. The large margin
solution has similar behaviour.
The perpendicular distance of a point x from a hyperplane defined by y(x) = 0 where
y(x) is given by |y(x)|/‖w‖. Furthermore, we are only interested in solutions for which all data
points are correctly classified, so that tny(xn) > 0 for all n. Thus the distance of a point xn to
the decision surface is given by
The margin is given by the perpendicular distance to the closest point xn from the data
set, and we wish to optimize the parameters w and b in order to maximize this distance. Thus
the maximum margin solution is found by solving
where we have taken the factor 1/ ‖w‖ outside the optimization over n because w does
not depend on n. Direct solution of this optimization problem would be very complex, and so
we shall convert it into an equivalent problem that is much easier to solve. To do this we note
that if we make the rescaling w → κw and b → κb, then the distance from any point xn to the
decision surface, given by tny(xn)/ ‖w‖, is unchanged. We can use this freedom to set
for the point that is closest to the surface. In this case, all data points will satisfy the constraints
This is known as the canonical representation of the decision hyperplane. In the case
of data points for which the equality holds, the constraints are said to be active, whereas for
the remainder they are said to be inactive. By definition, there will always be at least one
active constraint, because there will always be a closest point, and once the margin has been
maximized there will be at least two active constraints. The optimization problem then simply
requires that we maximize ‖w‖−1, which is equivalent to minimizing ‖w‖2, and so we have to
solve the optimization problem
SUPPORT VECTOR MACHINE (SVM)
SVM became popular in some years ago for solving problems in classification. An
important property of support vector machines is that the determination of the model
parameters corresponds to a convex optimization problem, and so any local solution is also a
global optimum. Because the discussion of support vector machines makes extensive use of
Lagrange multipliers,
later generalized under the name kernel machine, has been popular in recent years for a
number of reasons:
1. It is a discriminant-based method and uses Vapnik’s principle to never solve a more
complex problem as a first step before the actual problem. For example, in
classification, when the task is to learn the discriminant, it is not necessary to
estimate where the class densities p(x|Ci) or the exact posterior probability values
P(Ci|x); we only need to estimate where the class boundaries lie, that is, x where
P(Ci|x) = P(Cj
|x). Similarly, for outlier detection, we do not need to estimate the full density p(x);
we only need to find the boundary separating those x that have low p(x), that is, x
where p(x) < θ, for some threshold θ ∈ (0, 1).
2. After training, the parameter of the linear model, the weight vector, can be written
down in terms of a subset of the training set, which are the so-called support vectors.
In classification, these are the cases that are close to the boundary and as such,
knowing them allows knowledge extraction: Those are the uncertain or erroneous
cases that lie in the vicinity of the boundary between two classes. Their number gives
us an estimate of the generalization error.
3. the output is written as a sum of the influences of support vectors and these are given
by kernel functions that are application-specific measures of similarity between data
instances.
4. Data points are represented as vectors, For example, G1 and G2 may be two graphs
and K(G1,G2) may correspond to the number of shared paths, which we can calculate
without needing to represent G1 or G2 explicitly as vectors.
5. Kernel-based algorithms are formulated as convex optimization problems, and there
is a single optimum that we can solve for analytically. Therefore we are no
longer
bothered with heuristics for learning rates, initializations, checking for convergence, and
such the case of classification, and then generalize to ranking, outlier (novelty)
detection, and then dimensionality reduction. We see that in all cases basically we
have the similar quadratic program template to maximize the separability, or margin,
of instances subject to a constraint of the smoothness of solution. Solving for it, we
get the support vectors. The kernel function defines the space according to its notion
of similarity and a kernel function is good if we have better separation in its
corresponding space.
Not only do we want the instances to be on the right side of the hyperplane, but we
also want them some distance away, for better generalization. The distance from the
hyperplane to the instances closest to it on either side is called the margin, which we want to
maximize for best generalization.
It is better to take a rectangle halfway between S and G, to get a breathing space. This
is so that in case noise shifts a test instance slightly, it will still be on the right side of the
boundary.
Similarly, now that we are using the hypothesis class of lines, the optimal separating
hyperplane is the one that maximizes the margin. the distance of xt to the discriminant is
In finding the optimal hyperplane, we can convert the optimization problem to a form
whose complexity depends on N, the number of training instances, and not on d. Another
advantage of this new formulation is that it will allow us to rewrite the basis functions in
terms of kernel functions
Margins
Support Vectors
Hyper Plane
For a two-class problem where the instances of the classes are shown by plus signs
and dots, the thick line is the boundary and the dashed lines define the margins on either side.
Circled instances are the support vectors
For numerical stability, it is advised that this be done for all support vectors and an
average be taken. The discriminant thus found is called the support vector machine (SVM)
Figure shows an illustration of a recursive binary partitioning of the input space, along
with the corresponding tree structure. In this example, the first step divides the whole of the
input space into two regions according to whether x1 _ θ1 or x1 > θ1 where θ1 is a parameter
of the model. This creates two subregions, each of which can then be subdivided
independently. For instance, the region x1 _ θ1 is further subdivided according to whether x2
_ θ2 or x2 > θ2, giving rise to the regions denoted A and B. The recursive subdivision can be
described by the traversal of the binary tree shown in Figure
For any new input x, we determine which region it falls into by starting at the top of
the tree at the root node and following a path down to a specific leaf node according to the
decision criteria at each node. Note that such decision trees are not probabilistic graphical
models. Within each region, there is a separate model to predict the target variable. For
instance, or in classification we might assign each region to a specific class. A key property
of tree based models, which makes them popular in fields such as medical diagnosis, for
example, is that they are readily interpretable by humans because they correspond to a
sequence of binary decisions applied to the individual input variables. For instance, to predict
a patient’s disease, we might first ask “Treat patient with blood preassure lowering Drug?”. If
the answer is yes, then we might next ask “is there is there any side effects with the drug?”.
Each leaf of the tree is then associated with a specific diagnosis. In order to learn such a
model from a training set, we have to determine the structure of the tree, including which
input variable is chosen at each node to form the split criterion as well as the value of the
threshold parameter θi for the split. We also have to determine the values of the predictive
variable within each region. In a Classification problem the goal is to classify a target
variable t1 from a D-dimensional vector x = (x1, . . . , xD)T of input variables. The training
data consists of input vectors {x1, . . . , xN} along with the corresponding labels {t1,t2}. If the
partitioning of the input space is given, The pruning criterion is then given by
For classification problems, the process of growing and pruning the tree, While
training the Dataset the growing of tree is done, while testing the dataset search operation is
done on the tree, also we define pτk to be the proportion of data points in region Rτ assigned
to class k, where k = 1, . . . , K, then two commonly used choices are the cross-entropy.
And the Gini Index
These both vanish for pτk = 0and pτk = 1and have a maximum at pτk = 0.5. They
encourage the formation of regions in which a high proportion of the data points are assigned
to one class. The cross entropy and the Gini index are better measures than the
misclassification rate for growing the tree because they are more sensitive to the node
probabilities. Also, unlike misclassification rate, they are differentiable and hence better
suited to gradient based optimization methods. For subsequent pruning of the tree, the
misclassification rate is generally used.
However, in practice it is found that the particular tree structure that is learned is very
sensitive to the details of the data set, so that a small change to the training data can result in
a very different set of splits.
Random Forest
Our model extends existing forest-based techniques as it unifies classification,
regression, density estimation, manifold learning, semi-supervised learning and active
learning under the same decision forest framework. This means that the core implementation
needs be written and optimized only once, and can then be applied to many diverse tasks. The
proposed model may be used both in a generative or discriminative way and may be applied
to discrete or continuous, labelled or unlabelled data.
if we train not one but many decision trees, each on a random subset of training set or a
random subset of the input features, and combine their predictions, overall accuracy can be
significantly increased. This is the idea behind the random forest method.
The random forest algorithm works by completing the following steps:
Step 1: The algorithm select random samples from the dataset provided.
Step 2: The algorithm will create a decision tree for each sample selected. Then it will get a
prediction result from each decision tree created.
Step 3: Voting will then be performed for every predicted result. For a classification
problem, it will use mode, and for a regression problem, it will use mean.
Step 4: And finally, the algorithm will select the most voted prediction result as the final
prediction.
Averaging
Testing (Prediction)
• The unknown variables of testing dataset are predicted using the trained model
formed at training stage and the error rate is calculated.
UNIT IV
Suppose the true regression function that we are trying to predict is given by h(x), so that the
output of each of the models can be written as the true value plus an error in the form
Where,
Ex[・] - Frequentist Expectation with respect to the distribution of the input vector x.
The average error made by the models acting individually is therefore
ECOM =1/M(EAV)
• the Average Error of a model can be reduced by a factor of M simply by averaging M
versions of the model.
• Unfortunately, it depends on the key assumption that the errors due to the individual
models are uncorrelated.
• In practice, the errors are typically highly correlated, and the reduction in overall error
is generally small.
• It can, however, be shown that the expected committee error will not exceed the
expected error of the constituent models, so that ECOM ≤EAV.
d1
1−Y1
d2
1−Y2
d3
1−Y3
This voting method is generally used for classification problems. In this technique,
multiple models are used to make predictions for each data predictions by each model are
considered as a vote.
from the majority of models are used as the final prediction. For example, when you
asked 5 of your friends to rate your painting (out of 5); We will assume the out them rated it
as 4 while two of them gave it a 5. Since the majority gave the rating of 4, the final will be
taken as 4 You can consider this
as taking the mode of all the predictions.
1. Averaging
Similar to the max voting technique, multiple predictions are made for each point in
averaging. In this method, we take an average of predictions from all the models and use it to
make the final prediction. Averaging can be used for making predictions in regression
problems or while calculating probabilities for classification problems. For example, in the
below case, the averaging method would take the average of all the values.
3. Weighted Average
This is an extension of the averaging method. All models are assigned different
weights define the importance in each model for prediction. For instance, if two of your
friends are critics, while others have no prior experience in this field, the answers by these
two friends are given more Importance as compared, to the other. The result is calculated as
Student 1 Student 2 Student 3 Student 4 Student 5 Final
Rating
Weights 0.23 0.23 0.18 0.18 0.18
4.41
Rating 5 4 5 4 4
5x0.23+4x0.23+5x0.18+4x0.18+4x0.18=4.41
STACKING
1. In an ensemble learning technique, it uses predictions from multiple methods.
(decision tree, SVM, etc.) to build a new model. This model us used for predictions on the
test set.
2. The base model (decision tree) Is used on 9 parts of dataset, and predictions are
made using the 10th part of the dataset.
Train Set
1 Train Set
2 1
3 DT 2
. DT
3
10 model
.
Test Set 10
Test Set
DT
3. The base model (decision tree) Is fitted on the whole train dataset, and predictions
are made on the test.
4. Thus repeating for another base model say SVM, the train on 10 sets of dataset,
resulting in another set of predictions for and test data.
5. Prediction from the train set (Y1&Y2) are used as features to build new model.
6. This model is used to make final predictions on the test prediction set
Prediction
y1 y2 Y1 & Y2 used as Training Data for
DT & SVM
Final Predictions
Blending
This method follows the same approach as stacking
but it uses only a holdout validation set from the train set to make predictions.
The predictions are made on the holdout set only.
The holdout set and the predictions are used to build a model which will run on
the test dataset.
Validation
Y1 DT
Set Y1
Z1
DT
Output
.
Bagging (Bootstrap Aggregating)
The Bagging classifier is a general-purpose ensemble method that can be used with a
variety of different base models, such as decision trees, neural networks, and linear
models.
It uses bootstrap resampling to generate multiple different subsets of the training data,
and then trains a separate model on each subset.
The final predictions are made by combining the predictions of all the models by
using voting method whereby base-learners are made different by training them over
slightly different training sets.
As Bagging resamples the original training dataset with replacement, some instance
(or data) may be present multiple times while others are left out.
The Bagging classifier can be used to improve the performance of any base
classifier that has high variance, for example, decision tree classifiers.
The Bagging classifier can be used in the same way as the base classifier with the
only difference being the number of estimators and the bootstrap parameter.
It uses bootstrap resampling to generate multiple different subsets of the training data,
and then trains a separate model on each subset.
It reduces the variance of the model and can help to reduce overfitting.
Bootstrap Resampling
Original training dataset (samples): 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
Resampled training set 1: 2, 3, 3, 5, 6, 1, 8, 10, 9, 1
Resampled training set 2: 1, 1, 5, 6, 3, 8, 9, 10, 2, 7
Resampled training set 3: 1, 5, 8, 9, 2, 10, 9, 7, 5, 4
Original Subset
Boosting
• In boosting, Here Multi stage Combination of base-learners is done by training the
next learner on the mistakes of the previous learners.
• The original boosting algorithm combines three weak learners to generate a strong
learner.
• A weak learner has error probability less than 1/2, which makes it better than random
guessing on a two-class problem, and a strong learner has arbitrarily small error
probability.
• Given a large training set, (X1,X2,X3) we randomly divide it into three.
• We use 2 Models d1,d2,d3
• We use X1 to train d1.
• Then take X2 to Predict d2, take only the instances of misclassified by d1
• We then take X3 and predict it to d1 and d2.
• The instances on which d1 and d2 disagree form the training set of d3.
• During testing, given an instance, we give it to d1 and d2; if they agree, that is the
response, otherwise the response of d3 is taken as the output.
• This overall system has reduced error rate
Correctly
d1 (DT) Classified
Samples of d1
Training set X1 Mis−Classified
Samples of
d1
Correctly
d2 (SVM) Classified
Samples of d2
Training set X3
Mis−Classified Samples
of d1 & d2 X4 & X5
Correctly
d3 (Logistic
Regression) Classified
Samples of
Model d3
Algorithm(AdaBoost)
1. Initialise the dataset and assign equal weight to each of the data point.
2. Provide this as input to the model and identify the wrongly classified data points.
3. Increase the weight of the wrongly classified data points.
4. if (got required results)
Goto step 5
else
Goto step 2
5.End
X1 consists of 10 data points which consist of two types namely plus(+) and minus(-)
and 5 of which are plus(+) and the other 5 are minus(-) and each one has been assigned equal
weight initially. The first model tries to classify the data points and generates a vertical
separator line but it wrongly classifies 3 plus(+) as minus(-).
X2 consists of the 10 data points from the previous model in which the 3 wrongly
classified plus(+) are weighted more so that the current model tries more to classify these
pluses(+) correctly. This model generates a vertical separator line that correctly classifies the
previously wrongly classified pluses(+) but in this attempt, it wrongly classifies three
minuses(-
).
X3 consists of the 10 data points from the previous model in which the 3 wrongly
classified minus(-) are weighted more so that the two model tries more to classify these
minuses(-) correctly. This model generates a horizontal separator line that correctly classifies
the previously wrongly classified minuses(-).
X4 combines together X4, X6 in order to build a strong prediction model which is
much better than any individual model used.
Stacked Generalization
• The point of stacking is to explore a space of different models for the same problem.
• The idea is that you can attack a learning problem with different types of models
which are capable to learn some part of the problem, but not the whole space of
the problem.
• you can build multiple different learners and you use them to build an
intermediate prediction, one prediction for each learned model.
• Then you add a new model which learns from the intermediate predictions the same
target.
• This final model is said to be stacked on the top of the others, hence the name.
• Thus, it improves The overall performance
Algorithm
1. We split the training data into K subsets.
2. A base model is fitted on the K-1 parts and predictions are made for Kth part.
3. We do for each part of the training data.
4. The base model is then fitted on the whole train data set to calculate its
performance on the test set.
5. We repeat the last 3 steps for other base models.
6. Predictions from the train set are used as features for the second level model.
Second level model is used to make a prediction on the test set.
Test Data
10th Part
DT
X1 Y1
X2 SVM Y2
X3 Y3
Logistic
Training Data set Regression
Naïve
X1 =First 3 Parts
Bayesian
X2 = Next 3 Parts
X3 = Last 3 Parts
Supervised learning:
• discover patterns in the data that relate data attributes with a target (class) attribute.
• These patterns are then utilized to predict the values of the target attribute in future
data instances.
• Eg. KNN
Unsupervised learning:
The data have no target attribute.
We want to explore the data to find some intrinsic structures in them.
Eg. Clustering (K-Means)
UNSUPERVISED LEARNING
In unsupervised learning, there is no such supervisor and we only have input data. The
aim is to find the regularities in the input. There is a structure to the input space such that
certain patterns occur more often than others, and we want to see what generally happens and
what does not. In statistics, density estimation this is called density estimation.
a. Clustering
One method for density estimation is clustering where the aim is to find clusters or
groupings of input. In the case of a company with a data of past customers, the customer data
contains the demographic information as well as the past transactions with the company, and
the company may want to see the distribution of the profile of its customers, to see what type
of customers frequently occur. In such a case, a clustering model allocates customers similar
in their attributes to the same group, providing the company with natural groupings of its
customers; this is called customer segmentation. Once such groups are found, the company
may decide strategies, for example, services and products, specific to different groups; this is
known as customer relationship management. Such a grouping also allows identifying those
who are outliers, namely, those who are different from other customers, which may imply a
niche in the market that can be further exploited by the company.
An interesting application of clustering is in image compression. In this case, the
input instances are image pixels represented as RGB values. A clustering program groups
pixels with similar colors in the same group, and such groups correspond to the colors
occurring frequently in the image. If in an image, there are only shades of a small number of
colors, and if we code those belonging to the same group with one color, for example, their
average, then the image is quantized. Let us say the pixels are 24 bits to represent 16 million
colors, but if there are shades of only 64 main colors, for each pixel we need 6 bits instead of
24. For example, if the scene has various shades of blue in different parts of the image, and if
we use the same average blue for all of them, we lose the details in the image but gain space
in storage and transmission. Ideally, we would like to identify higher-level regularities by
analyzing repeated image patterns, for example, texture, objects, and so forth. This allows a
higher-level, simpler, and more useful description of the scene, and for example, achieves
better compression than compressing at the pixel level. If we have scanned document pages,
we do not have random on/off pixels but bitmap images of characters. There is structure in
the data, and
we make use of this redundancy by finding a shorter description of the data: 16 × 16 bitmap of
‘A’ takes 32 bytes; its ASCII code is only 1 byte.
In document clustering, the aim is to group similar documents. For example, news
reports can be subdivided as those related to politics, sports, fashion, arts, and so on.
Commonly, a document is represented as a bag of words—that is, we predefine a lexicon of
N words, and each document is an N-dimensional binary vector whose element i is 1 if word i
appears in the document; suffixes “–s” and “–ing” are removed to avoid duplicates and words
such as “of,” “and,” and so forth, which are not informative, are not used. Documents are
then grouped depending on the number of shared words. It is of course critical how the
lexicon is chosen.
Machine learning methods are also used in bioinformatics. DNA in our genome is the
“blueprint of life” and is a sequence of bases, namely, A, G, C, and T. RNA is transcribed
from DNA, and proteins are translated from the RNA. Proteins are what the living body is
and does. Just as a DNA is a sequence of bases, a protein is a sequence of amino acids (as
defined by bases). One application area of computer science in molecular biology is
alignment, which is matching one sequence to another. This is a difficult string matching
problem because strings may be quite long, there are many template strings to match against,
and there may be deletions, insertions, and substitutions. Clustering is used in learning, which
are sequences of amino acids that occur repeatedly in proteins. Motifs are of interest because
they may correspond to structural or functional elements within the sequences they
characterize. The analogy is that if the amino acids are letters and proteins are sentences,
motifs are like words, namely, a string of letters with a particular meaning occurring
frequently in different sentences.
b. Reinforcement Learning
In some applications, the output of the system is a sequence of actions. In such a case,
a single action is not important; what is important is the policy that is the sequence of correct
actions to reach the goal. There is no such thing as the best action in any intermediate state;
an action is good if it is part of a good policy. In such a case, the machine learning program
should be able to assess the goodness of policies and learn from past good action sequences
to be able to generate a policy. Such learning methods are called reinforcement learning
algorithms. learning A good example is game playing where a single move by itself is not
that important; it is the sequence of right moves that is good. A move is good if it is part
of a good game playing policy. Game playing is an important research area in both artificial
intelligence and machine learning. This is because games are easy to describe and at the same
time, they are quite difficult to play well. A game like chess has a small number of rules but
it is very complex because of the large number of possible moves at each state and the large
number of moves
that a game contains. Once we have good algorithms that can learn to play games well, we
can also apply them to applications with more evident economic utility.
A robot navigating in an environment in search of a goal location is another
application area of reinforcement learning. At any time, the robot can move in one of a
number of directions. After a number of trial runs, it should learn the correct sequence of
actions to reach to the goal state from an initial state, doing this as quickly as possible and
without hitting any of the obstacles. One factor that makes reinforcement learning harder is
when the system has unreliable and partial sensory information. For example, a robot
equipped with a video camera has incomplete information and thus at any time is in a
partially observable state and should decide on its action taking into account this uncertainty;
for example, it may not know its exact location in a room but only that there is a wall to its
left. A task may also require a concurrent operation of multiple agents that should interact
and cooperate to accomplish a common goal. An example is a team of robots playing soccer.
Clustering
• Clustering is a technique for finding Identical groups in data, called clusters. I.e.,
• it groups data samples that are similar (near) to each other in one cluster and data
instances that are very different (far away) from each other into different clusters.
• Clustering is often called an unsupervised learning task as no class values denoting
an a priori grouping of the data instances are given, which is the case in supervised
learning.
• Uses A distance function (for finding similarity)
• Clustering quality
• Inter-clusters distance maximized
• Intra-clusters distance minimized
• The quality of a clustering result depends on the algorithm, the distance function, and
the application.
K-means Clustering
• K-means is a clustering algorithm
• Let the set of data points (or instances) X be
{x1, x2, …, xn}, Samples
• where xi = (xi1, xi2, …, xir) is a vector in a real-valued space X Rr,
• r is the number of attributes (Features) in the data.
• The k-means algorithm partitions the given data (Xi ) into any of the k clusters.
• Each cluster has a cluster center, called centroid.
• k is specified by the user
Algorithm
1) Randomly choose k data points (seeds) to be the initial cluster centers (centroids),
2) Assign each data point to the closest cluster centers by finding the distance of
Sample and cluster center
3) Re-compute the cluster centers (centroids) using the current cluster memberships.
4) If a convergence criterion is not met, go to (2).
1
cluster. It is used for finding the membership of the class.
𝛍𝐣 = ∑ 𝑋𝑖
𝐶 𝑗|
| 𝑋
𝑖 ∈𝐶
𝑗
Cluster Center Assignment i=0 Membership of Data samples i=0 Cluster Center Updating i=1
Membership of Data samples i=1 Cluster Center Updating i=2 Membership of Data samples i=2
Cluster Center Updating i=3 Membership of Data samples i=3 Cluster Center Upda1ti3ng i=4
Illustration of the K-means algorithm using the re-scaled Old Faithful data set.
(a) Green points denote the data set in a two-dimensional Euclidean space. The initial choices
for centres μ1 and μ2 are shown by the red and blue crosses, respectively.
(b) In the initial E step, each data point is assigned either to the red cluster or to the blue
cluster, according to which cluster centre is nearer. This is equivalent to classifying the points
according to which side of the perpendicular bisector of the two cluster centres, shown by the
magenta line, they lie on.
(c) In the subsequent M step, each cluster centre is re-computed to be the mean of the points
assigned to the corresponding cluster.
(d)–(i) show successive E and M steps through to final convergence of the algorithm.
Instance Based Learning
In machine learning literature, nonparametric methods are also called instance-based
or memory-based learning algorithms, since what they do is store the training instances in a
lookup table and interpolate from these. This implies that all of the training instances should
be stored and storing all requires memory of O(N). Furthermore, given an input, similar ones
should be found, andl finding them requires computation of O(N).
X X X
• Let us define a distance between a and b, for example, |a − b|, and for each x,
• we d1(x) ≤ d2(x) ≤ · · · ≤ dN(x) to be the distances arranged in ascending order, from x
to the points in the sample:
• d1(x) is the distance to the nearest sample
• d2(x) is the distance to the next nearest, and so on.
• If xt are the data points, then we define
• d1(x) = min t |x − xt |,
• if i is the index of the closest sample, namely, i = arg mint |x − xt |, then
• d2(x) = minj=i |x − xj |, and so forth.
• The k-nearest neighbor (k-nn) density estimate is
KNN Classifier
• The nearest neighbor algorithm is an instance-based Lazy learning algorithm.
• It defer the computation for classifying a sample until a test sample is ready to be
classified.
• It meets the criteria by storing the entire training set in memory and calculating the
distance from a test sample to every training sample at classification time.
• the predicted class of the test sample is the class of the closest training sample.
• The nearest neighbor algorithm is a specific instance of the k-nearest neighbor
algorithm where k = 1.
• While a test sample is there , In order to classify, we tabulate the classes for each of
the k closest training samples and predict the class of the test sample as the mode of
the training samples’ classes.
• In binary classification tasks, k is normally chosen to be an odd number in order to
avoid ties.
By calculating the Euclidean distance, we got the nearest neighbors, as three nearest
neighbors in category A and two nearest neighbors in category B.
if we proceed with K=3, then we predict that test input belongs to class B, and if we
continue with K=7, then we predict that test input belongs to class A.
Gaussian Mixture Models
• Clusters modeled as Gaussians and Not just by their mean.
• EM algorithm: assign data to cluster with some probability
• It models each cluster using one of these Gaussian bells
• The Gaussian mixture model uses a simple linear superposition of Gaussian components.
• Aimed at providing a richer class of density models than the single Gaussian.
• Now turn to a formulation of Gaussian mixtures in terms of discrete latent variables.
• Recall from the Gaussian mixture distribution can be written as a linear superposition of
Gaussians in the form.
-
- - 0 1 2 3 4 5
M-step “Maximization”
• Start with assignment probabilities ric
• Update parameters: mean ¹c, Covariance §c, “size” ¼c
• For each cluster (Gaussian) xc,
• Update its parameters using the (weighted) data points.
Total responsibility allocated to cluster c
Fraction of total assigned to cluster c
Weighted mean of assigned data
Perceptron
First neural network learning model in the 1960’s
Simple and limited (single layer model)
Still used in some current applications (large business problems, where intelligibility
is needed, etc.)
• w0 is the intercept value to make the model more general; it is generally modeled as
the weight coming from an extra bias unit, x0, which is always 1.
• We can write the output of the perceptron as a dot product y = wT x
-Threshold
Treat threshold like any other weight. No special case. Call it a bias since it biases
the output up or down.
the perceptron can separate two classes by checking the sign of the output.
s(·) as the threshold function
Remember that using a linear discriminant assumes that classes are linearly separable.
It is assumed that a hyperplane wT x = 0 can be found that separates xt ∈ C1 and xt ∈
C2.
Δwi = c(t – z) xi
Where,
o wi is the weight from input i to perceptron node,
o c is the learning rate,
o t is the target for the current instance,
o z is the current output, and xi is ith input
Create a perceptron node with n inputs
Iteratively apply a pattern from the training set and apply the perceptron rule
Each iteration through the training set is an epoch
Continue training until total training set error ceases to improve
Perceptron Convergence Theorem: Guaranteed to find a solution in finite time if a
solution exists
Example N=2
x1 w
𝜃
1
x1 x2 t x2 w
n
.8 .3 1
.4 .1 0
x1 .
4
θ=.2 y
5
x2 -.2
Multilayer perceptron
A perceptron that has a single layer of weights can only approximate linear
functions of the input and cannot solve problems like the XOR, where the
discriminant to be estimated is nonlinear.
These compute a series of transformations, the first layer is the input and the last layer
is the output.
It is used for classification, MLP can implement nonlinear discriminants.
Input x is fed to the input layer (including the bias), the “activation” propagates in the
forward direction, and the values of the hidden units zh are calculated.
Each hidden unit is a perceptron by itself and applies the nonlinear sigmoid function
to its weighted sum:
The output yi are perceptrons in the second layer taking the hidden units as their
inputs
where there is also a bias unit in the hidden layer, which we denote by z0, and vi0 are
the bias weights. In a two-class discrimination task, there is one sigmoid output unit
and when there are K > 2 classes, there are K outputs with softmax as the output
nonlinearity.
If the hidden units’ outputs were linear, the hidden layer would be of no use: Linear
combination of linear combinations is another linear combination.
Sigmoid is the continuous, differentiable version of thresholding.
Activation Functions
Neural networks are composed of nodes or units connected by directed links. A link
from unit i to unit j serves to propagate the activation ai from i to j.8 Each link also has a
numeric weight wi,j associated with it, which determines the strength and sign of the
connection.
The activation function g is typically either a hard threshold, in which case the unit is
called a perceptron,
Types of Networks
Feedforward networks
These compute a series of transformations
Typically, the first layer is the input and the last layer is the output.
In feed-forward networks with intermediate or hidden layers between the input
and the output layers.
output units
input units
Recurrent networks
These have directed cycles in their connection graph. They can have complicated
dynamics.
More biologically realistic.
output units
input units
Feed-Forward Networks
Single layer feed-forward networks
Input layer projecting into the output layer
Single layer
network
Perceptron Model
Y I ( wi t)
xi
i
threshold, t
Algorithm for learning ANN
Initialize the weights (w0, w1, …, wk)
Adjust the weights in such a way that the output of ANN is consistent with
class labels of training examples
Error function:
Find the weights wi’s that minimize the above error function
e.g., gradient descent, backpropagation algorithm
Type of Network
A neural network has many layers and each layer performs a specific function, and as
the complexity of the model increases, the number of layers also increases that why it is
known as the multi-layer perceptron.
Hidden Layer
At First, information is feed into the input layer which then transfers it to the hidden
layers, and interconnection between these two layers assign weights to each input
randomly at the initial point.
and then bias is added to each input neuron and after this, the weighted sum which is a
combination of weights and bias is passed through the activation function.
Activation Function has the responsibility of which node to fire for feature extraction and
finally output is calculated.
This whole process is known as Forward Propagation. After getting the output model to
compare it with the original output and the error is known and finally, weights are
updated in backward propagation to reduce the error and this process continues for a
certain number of epochs (iteration).
Finally, model weights get updated and prediction is done.
This update is repeated by cycling through the data either in sequence or by selecting
points at random with replacement.
There are of course intermediate scenarios in which the updates are based on batches
of data points.
One advantage of on-line methods compared to batch methods is that the former
handle redundancy in the data much more efficiently.
For example, consider an extreme case, in which we take a data set and double its size
by duplicating every data point.
Note that this simply multiplies the error function by a factor of 2 and so is equivalent
to using the original error function.
Batch methods will require double the computational effort to evaluate the batch error
function gradient, whereas online methods will be unaffected.
Another property of on-line gradient descent is the possibility of escaping from local
minima, since a stationary point with respect to the error function for the whole data
set will generally not be a stationary point for each data point individually.
Network Training Procedures
Improving Convergence
Gradient descent has various advantages. It is simple. It is local; namely, the change in a weight
uses only the values of the presynaptic and postsynaptic units and the error (suitably
backpropagated). When online training is used, it does not need to store the training set and
can adapt as the task to be learned changes. Because of these reasons, it can be (and is)
implemented in hardware. But by itself, gradient descent converges slowly. When learning
time is important, one can use more sophisticated optimization methods. Bishop discusses in
detail the application of conjugate gradient and second-order methods to the training of
multilayer perceptrons. However, there are two frequently used simple techniques that
improve the performance of the gradient descent considerably, making gradient-based
methods feasible in real applications.
Momentum
Let us say wi is any weight in a multilayer perceptron in any layer, including the
biases. At each parameter update, successive Δwti values may be so different that large
oscillations may occur and slow convergence. t is the time index that is the epoch number in
batch learning and the iteration number in online learning. The idea is to take a running
average by incorporating the previous update in the current change as if there is a momentum
due to previous updates:
α is generally taken between 0.5 and 1.0. This approach is especially useful when online learning
is used, where as a result we get an effect of averaging and smooth the trajectory during
convergence. The disadvantage is that the past Δwt−1i values should be stored in extra
memory.
By gradient descent,
Iteratively process a set of training tuples & compare the network's prediction
with the actual known target value
For each training tuple, the weights are modified to minimize the mean
squared error between the network's prediction and the actual target value
Modifications are made in the “backwards” direction: from the output layer,
through each hidden layer down to the first hidden layer, hence
“backpropagation”
Steps
Initialize weights (to small random #s) and biases in the network
Propagate the inputs forward (by applying activation function)
Backpropagate the error (by updating weights and biases)
Terminating condition (when error is very small, etc.)
Error Backpropagation
The goal is to find an efficient technique for evaluating the gradient of an error function
E(w) for a feed-forward neural network.
It can be achieved using a local message passing scheme in which information is sent
alternately forwards and backwards through the network and is known as error
backpropagation, or sometimes simply as backprop.
In the first stage, the derivatives of the error function with respect to the weights must
be evaluated. The important contribution of the backpropagation technique is in providing a
computationally efficient method for evaluating such derivatives. Because it is at this stage
that errors are propagated backwards through the network, If the evaluation of derivatives is
used it is termed as backpropagation. the propagation of errors backwards through the
network in
order to evaluate derivatives, can be applied to many other kinds of network and not just the
multilayer perceptron. It can also be applied to error functions other that just the simple sum-
of-squares, and to the evaluation of other derivatives such as the Jacobian and Hessian
matrices.
In the second stage, the derivatives are then used to compute the adjustments to be
made to the weights. The simplest such technique, and the one originally considered, involves
gradient descent. The weight adjustment using the calculated derivatives can be tackled using
a variety of optimization schemes, many of which are substantially more powerful than
simple gradient descent. It is important to recognize that the two stages are distinct.
and
In the above derivation we have implicitly assumed that each hidden or output unit in the
network has the same activation function h(・). The derivation is easily generalized, however,
to allow different units to have individual activation functions, simply by keeping track of
which form of h(・) goes with which unit.
On the graph below you can see a comparison between the sigmoid function itself and its
derivative. First derivatives of sigmoid functions are bell curves with values ranging from 0
to 0.25.
Our
knowledge of how neural networks perform forward and backpropagation is essential to
understanding the vanishing gradient problem.
Back Propagation
As the network generates an output, the loss function(C) indicates how well it
predicted the output. The network performs back propagation to minimize the loss. A back
propagation method minimizes the loss function by adjusting the weights and biases of the
neural network. In this method, the gradient of the loss function is calculated with respect to
each weight in the network.
In back propagation, the new weight(w new) of a node is calculated using the old
weight(wold) and product of the learning rate(ƞ) and gradient of the loss function Vanishing
With the chain rule of partial derivatives, we can represent gradient of the loss
function as a product of gradients of all the activation functions of the nodes with respect to
their weights. Therefore, the updated weights of nodes in the network depend on the gradients
of the activation functions of each node.
For the nodes with sigmoid activation functions, we know that the partial derivative of
the sigmoid function reaches a maximum value of 0.25. When there are more layers in the
network, the value of the product of derivative decreases until at some point the partial
derivative of the loss function approaches a value close to zero, and the partial derivative
vanishes. We call this the vanishing gradient problem.
With shallow networks, sigmoid function can be used as the small value of gradient
does not become an issue. When it comes to deep networks, the vanishing gradient could
have a significant impact on performance. The weights of the network remain unchanged as
the derivative vanishes. During back propagation, a neural network learns by updating its
weights and biases to reduce the loss function. In a network with vanishing gradient, the
weights cannot be updated, so the network cannot learn. The performance of the network will
decrease as a result.
Method to overcome the problem
The vanishing gradient problem is caused by the derivative of the activation function
used to create the neural network. The simplest solution to the problem is to replace the
activation function of the network. Instead of sigmoid, use an activation function such as
ReLU. Rectified Linear Units (ReLU) are activation functions that generate a positive linear
output when they are applied to positive input values. If the input is negative, the function
will return zero.
The derivative of a ReLU function is defined as 1 for inputs that are greater than zero and 0
for inputs that are negative. The graph shared below indicates the derivative of a ReLU
function
Shallow Network & Deep Network
Shallow Network Deep Network
A shallow neural network has only one A deep network, on the other hand,
hidden layer between the input and output can capture more complex patterns in the
layers, while a deep neural network has data and potentially achieve higher
multiple hidden layers. accuracy, but it is more computationally
One way to think about it is like a hierarchy expensive to train and may require more
of decision-making. Just like how a human data to avoid overfitting.
brain makes decisions by processing Additionally, deep networks can be
information in layers, a deep neural network more challenging to design and optimize
learns to make decisions by processing than shallow networks.
information through multiple hidden layers. When it comes to deep networks, the
On the other hand, a shallow network is like vanishing gradient could have a significant
having just one layer of decision-making, impact on performance. The weights of the
which might not be enough to capture the network remain unchanged as the derivative
complexity of the problem at hand. vanishes. During back propagation, a neural
A shallow network might be used for network learns by updating its weights and
simple tasks like image classification, while biases to reduce the loss function. In a
a deep network might be used for more network with vanishing gradient, the
complex tasks like image segmentation or weights cannot be updated, so the network
natural language processing. cannot learn. The performance of the
The main advantage of a shallow network is network will decrease as a result.
that it is computationally less expensive to
train, and can be sufficient for simple tasks.
However, it may not be powerful enough to
capture complex patterns in the data.
The Rectified Linear Unit is the most commonly used activation function in deep
learning models. The function returns 0 if it receives any negative input, but for any positive
value x it returns that value back. So it can be written as f(x)=max(0,x).
Graphically it looks like this
It's surprising that such a simple function (and one composed of two linear pieces) can
allow your model to account for non-linearities and interactions so well. But the ReLU
function works great in most applications, and it is very widely used as a result.
2) Help a model account for non-linear effects. This just means that if I graph a variable on
the horizontal axis, and my predictions on the vertical axis, it isn't a straight line. Or said
another way, the effect of increasing the predictor by one is different at different values of
that predictor.
Interactions: Imagine a single node in a neural network model. For simplicity, assume it has
two inputs, called A and B. The weights from A and B into our node are 2 and 3 respectively.
So the node output is f(2A+3B). We'll use the ReLU function for our f. So, if 2A+3B is
positive, the output value of our node is also 2A+3B. If 2A+3B is negative, the output value
of our node is 0.
For concreteness, consider a case where A=1 and B=1. The output is 2A+3B, and if A
increases, then the output increases too. On the other hand, if B=-100 then the output is 0, and
if A increases moderately, the output remains 0. So A might increase our output, or it might
not. It just depends what the value of B is.
This is a simple case where the node captured an interaction. As you add more nodes
and more layers, the potential complexity of interactions only increases. But you should now
see how the activation function helped capture an interaction.
Non-linearities: A function is non-linear if the slope isn't constant. So, the ReLU function is
non-linear around 0, but the slope is always either 0 (for negative values) or 1 (for positive
values). That's a very limited type of non-linearity.
But two facts about deep learning models allow us to create many different types of
non-linearities from how we combine ReLU nodes.
First, most models include a bias term for each node. The bias term is just a constant
number that is determined during model training. For simplicity, consider a node with a
single input called A, and a bias. If the bias term takes a value of 7, then the node output is
f(7+A). In this case, if A is less than -7, the output is 0 and the slope is 0. If A is greater than -
7, then the node's output is 7+A, and the slope is 1.
So the bias term allows us to move where the slope changes. So far, it still appears we
can have only two different slopes.
However, real models have many nodes. Each node (even within a single layer) can
have a different value for it's bias, so each node can change slope at different values for
our input.
When we add the resulting functions back up, we get a combined function that
changes slopes in many places.
These models have the flexibility to produce non-linear functions and account for
interactions well (if that will giv better predictions). As we add more nodes in each layer (or
more convolutions if we are using a convolutional model) the model gets even greater ability
to represent these interactions and non-linearities.
However researchers had great difficulty building models with many layers when
using the tanh function. It is relatively flat except for a very narrow range (that range being
about -2 to 2). The derivative of the function is very small unless the input is in this narrow
range, and this flat derivative makes it difficult to improve the weights through gradient
descent. This problem gets worse as the model has more layers. This was called the
vanishing gradient problem.
The ReLU function has a derivative of 0 over half it's range (the negative numbers).
For positive inputs, the derivative is 1.
When training on a reasonable sized batch, there will usually be some data points
giving positive values to any given node. So the average derivative is rarely close to 0, which
allows gradient descent to keep progressing.
Leaky ReLU
There are many similar alternatives which also work well. The Leaky ReLU is one of
the most well known. It is the same as ReLU for positive numbers. But instead of being 0
for all negative values, it has a constant slope (less than 1.).
That slope is a parameter the user sets when building the model, and it is frequently
called α. For example, if the user sets α=0.3=0.3, the activation function is f(x) = max(0.3*x,
x). This has the theoretical advantage that, by being influenced by x at all values, it may be
make more complete use of the information contained in x.
Their are other alternatives, but both practitioners and researchers have generally
found insufficient benefit to justify using anything other than ReLU.
Hyperparameter Tuning
A Machine Learning model is defined as a mathematical model with a number of
parameters that need to be learned from the data. By training a model with existing data, we
are able to fit the model parameters. However, there is another kind of parameter, known
as Hyperparameters, that cannot be directly learned from the regular training process. They
are usually fixed before the actual training process begins. These parameters express
important properties of the model such as its complexity or how fast it should learn.
Some examples of model hyperparameters include:
Models can have many hyperparameters and finding the best combination of parameters can be
treated as a search problem. The two best strategies for Hyperparameter tuning are:
GridSearchCV
In GridSearchCV approach, the machine learning model is evaluated for a range of
hyperparameter values. This approach is called GridSearchCV, because it searches for the
best set of hyperparameters from a grid of hyperparameters values.
For example, if we want to set two hyperparameters C and Alpha of the Logistic
Regression Classifier model, with different sets of values. The grid search technique will
construct many versions of the model with all possible combinations of hyperparameters and
will return the best one.
As in the image, for C = [0.1, 0.2, 0.3, 0.4, 0.5] and Alpha = [0.1, 0.2, 0.3, 0.4]. For a
combination of C=0.3 and Alpha=0.2, the performance score comes out to
be 0.726(Highest), therefore it is selected.
RandomizedSearchCV
RandomizedSearchCV solves the drawbacks of GridSearchCV, as it goes through only a fixed
number of hyperparameter settings. It moves within the grid in a random fashion to find
the best set of hyperparameters. This approach reduces unnecessary computation.
Batch Normalization
One of the most common problems of data science professionals is to avoid over-fitting.
Have you come across a situation when your model is performing very well on the training
data but is unable to predict the test data accurately. The reason is your model is overfitting.
The solution to such a problem is regularization.
The regularization techniques help to improve a model and allows it to converge faster. We
have several regularization tools at our end, some of them are early stopping, dropout, weight
initialization techniques, and batch normalization. The regularization helps in preventing the
over-fitting of the model and the learning process becomes more efficient.
Here, in this article, we are going to explore one such technique, batch normalization
in detail.
Normalization
Normalization is a data pre-processing tool used to bring the numerical data to a
common scale without distorting its shape.
Generally, when we input the data to a machine or deep learning algorithm we tend to
change the values to a balanced scale. The reason we normalize is partly to ensure that our
model can generalize appropriately.
But what is the reason behind the term “Batch” in batch normalization? A typical
neural network is trained using a collected set of input data called batch. Similarly, the
normalizing process in batch normalization takes place in batches, not as a single input.
Let’s understand this through an example, we have a deep neural network as shown in
the following image.
Initially, our inputs X1, X2, X3, X4 are in normalized form as they are coming from
the pre-processing stage. When the input passes through the first layer, it transforms, as a
sigmoid function applied over the dot product of input X and the weight matrix W.
Similarly, this transformation will take place for the second layer and go till the last layer L
as shown in the following image.
Although, our input X was normalized with time the output will no longer be on the same
scale. As the data go through multiple layers of the neural network and L activation functions
are applied, it leads to an internal co-variate shift in the data.
Normalization working
Since by now we have a clear idea of why we need Batch normalization, let’s
understand how it works. It is a two-step process. First, the input is normalized, and later
rescaling and offsetting is performed.
Normalization is the process of transforming the data to have a mean zero and
standard deviation one. In this step we have our batch input from layer h, first, we need to
calculate the mean of this hidden activation.
Once we have meant at our end, the next step is to calculate the standard deviation of
the hidden activations.
Further, as we have the mean and the standard deviation ready. We will normalize the
hidden activations using these values. For this, we will subtract the mean from each input
and divide the whole value with the sum of standard deviation and the smoothing term (ε).
The smoothing term(ε) assures numerical stability within the operation by stopping a
division by a zero value.
Rescaling of Offsetting
In the final operation, the re-scaling and offsetting of the input take place. Here two
components of the BN algorithm come into the picture, γ(gamma) and β (beta). These
parameters are used for re-scaling (γ) and shifting(β) of the vector containing values from the
previous operations.
These two are learnable parameters, during the training neural network ensures the
optimal values of γ and β are used. That will enable the accurate normalization of each batch.
Regularization
Overfitting is a phenomenon that occurs when a Machine Learning model is
constraint to training set and not able to perform well on unseen data.
1. L1 regularization
2. L2 regularization
3. Dropout regularization
This article focus on L1 and L2 regularization.
A regression model which uses technique is called
L1 Regularization LASSO (Least Absolute Shrinkage and Selection
Operator) regression.
Lasso Regression adds “absolute value of magnitude” of coefficient as penalty term to
the loss function(L).
Ridge regression adds “squared magnitude” of coefficient as penalty term to the loss
function(L).
during Regularization the output function(y_hat) does not change. The change is only in the
loss function.