Learningintro Notes
Learningintro Notes
S Luz
[email protected]
October 22, 2013
Agent
perception
representation
Critic
rewards/ Perception
instruction
Learner
Goals changes
Actuators
Interaction
planner action policy
action
Some questions
1
Could you give us some examples?
YES:
Draughts (checkers)
Noughts & crosses (tic-tac-toe)
Defining learning
ML has been studied from various perspectives (AI, control theory, statis-
tics, information theory, ...)
(Mitchell, 1997, p. 2)
Statistics, model-fitting, ...
Some examples
Data Mining
2
if Name = Corners & Energy < 25
then
turn(91 - (Bearing - const)
fire(3)
Recommendation services,
Bayes spam filtering
JIT information retrieval
Training experience: How will the system access and use data?
Target function: What exactly should be learned?
Hypothesis representation: How will we represent the concepts to be
learnt?
Inductive inference: What specific algorithm should be used to learn
the target concepts?
3
Determining the target function
4
Given an function f : X C trained on a set of instances Dc describing a
concept c, we say that the inductive bias of f is a minimal set of assertions B,
such that for any set of instanced X:
x X(B Dc x ` f(x))
This should not be confused with estimation bias, which is a quantitity
(rather than a set of assumptions) which quantifies the average loss due to
misclassification. Together with variance, this quantitity determines the level
of generalisation error of a learner (Domingos, 2012).
Choosing an algorithm
Induction task as search for a hypothesis (or model) that fits the data and
sample of the target function available to the learner, in a large space of
hypotheses
The choice of learning algorithm is conditioned to the choice of represen-
tation
Since the target function is not completely accessible to the learner, the
algorithm needs to operate under the inductive learning assumption that:
an approximation that performs well over a sufficiently large set
of instances will perform well on unseen data
X O O
Reinforcement learning: noughts and crosses
(Sutton and Barto, 1998) O X X
X
Task? (target function, data representation) Training experience? Perfor-
mance measure?
5
A target for a draughts learner
where:
bp(b): number of black pieces on board b
rp(b): number of red pieces on b
bk(b): number of black kings on b
rk(b): number of red kings on b
bt(b): number of red pieces threatened by black
rt(b): number of black pieces threatened by red
Training Experience
6
A simple rule for obtaining (estimating) training values:
P LMS minimises the squared error between training data and current approx.:E
(f train (b) (b))2 Notice that if error(b) = 0 (i.e. target and
f
hb,ftrain (b)iD
approximation match) no weights change. Similarly, if or ti = 0 (i.e. fea-
ture ti doesnt occcur) the corresponding weight doesnt get updated. This
weight update rule can be shown to perform a gradient descent search for the
minimal squared error (i.e. weight updates are proportional to E where
E
E = [ w , E , . . . ]).
0 w1
That the LMS weight update rule implements gradient descent can be seen
by differentiating E:
[f (b) f(b)]2
P
E
=
wi wi
[f (b) f(b)]2
P
=
wi
X
= 2 [f (b) f(b)] [f (b) f(b)]
wi
|D|
X X
= 2 [f (b) f(b)] [f (b) wj tj ]
wi i
X
= 2 error(b) tj
7
Determine Type
of Training Experience
Determine
Target Function
Determine Representation
of Learned Function
...
Polynomial
Linear function Artificial neural
of six features network
Determine
Learning Algorithm
Linear ...
Gradient programming
descent
Completed Design
(from (Mitchell, 1997)) These are some of the deci-
sions involved in ML design. A number of other practical factors, such as evaluation, avoidance
of overfitting, feature engineering, etc. See (Domingos, 2012) for a useful introduction, and
some machine learning folk wisdom.
rewards/
(b, ftrain (b), . . .) Perception
instruction
Learner
Goals changes
f
Actuators
= arg max f(s), s
Interaction
planner initial board
action policy
action
Reinforcement Learning
8
Basic concepts of Reinforcement Learning
The policy: defines the learning agents way of behaving at a given time:
:SA
The (long term) value function: the total amount of reward an agent can
expect to accumulate over the future:
V :SR
Theoretical background
Engineering: optimal control (dating back to the 50s)
Markov Decision Processes (MDPs)
Dynamic programming
Law of effect: Of several responses made to the same situation, those which
are accompanied or closely followed by satisfaction to the animal will, other
things being equal, be more firmly connected with the situation, so that, when
it recurs, they will be more likely to recur; those which are accompanied or
closely followed by discomfort to the animal will, other things being equal, have
their connections with that situation weakened, so that, when it recurs, they will
be less likely to occur. The greater the satisfaction or discomfort, the greater
the strengthening or weakening of the bond. (Thorndike, 1911, p. 244, quoted
by (Sutton and Barto, 1998))
The selectional aspect means that the learner will select from a large pool
of complete policies, overlooking individual states (e.g. the contribution of a
particular move to winning the game).
The associative aspect means that the learner associates action (move) to a
value but does not select (or compare) policies as a whole.
9
Example: Noughts and crosses
Assign values to each possible game state (e.g. the probability of winning
from that state):
10
s0
opponents
move
o
our (greedy)
move
An update rule:
s1 (TD learning)
o An exploratory
move
step-size parameter
si s5
back up (learning rate)
value (for
greedy moves)
o
sk
11
References
Domingos, P. (2012). A few useful things to know about machine learning.
Communications of the ACM, 55(10):7887.
Mitchell, T. M. (1997). Machine Learning. McGraw-Hill.
12