Lec 12
Artificial Neural Networks:
supervised and unsupervised
Artificial Neural Networks: supervised and
unsupervised
• A neural network is a massively parallel distributed computing system
that has a natural tendency for storing experiential knowledge and
making it available for use. It resembles the brain in two respects:
• Knowledge is acquired by the network through a learning process
(called training) Interneuron connection strengths known as synaptic
weights are used to store the knowledge
• Knowledge in the artificial neural networks is implicit and distributed.
• Advantages
• Excellent for pattern recognition
• Excellent classifiers
• Handles noisy data well
• Good for generalization
Draw backs
• The power of ANNs lie in their parallel architecture
– Unfortunately, most machines we have are serial (Von Neumann
architecture)
• Lack of defined rules to build a neural network for a specific
problem
– Too many variables, for instance, the learning algorithm, number of
neurons per layer, number of layers, data representation etc.
• Knowledge is implicit
• Data dependency
But all these drawbacks doesn’t mean that the neural networks
are useless artifacts. They are still arguably very powerful
general purpose problem solvers.
Learning methodology
o Supervised
Given a set of example input/output pairs, find a rule
that does a good job of predicting the output associated with
a new input.
o Unsupervised
Given a set of examples with no labeling, group them
into sets called clusters
Knowledge is not explicitly represented in ANNs. Knowledge
is primarily encoded in the weights of the neurons within the
network
Design phases of ANNs
• Feature Representation
– The number of features are determined using no. of inputs
for the problem.
• Training
– Training is either supervised or unsupervised.
• Similarity Measurement
– A measure to tell the difference between the actual output of
the network while training and the desired labeled output
• Validation
– During training, training data is divided into k data sets; k-1
sets are used for training, and the remaining data set is used
for cross validation. This ensures better results, and avoids
over-fitting.
Supervised
• Given a set of example input/output pairs, find a rule that does a good
job of predicting the output associated with a new input.
Back propagation algorithm
• 1. Randomize the weights {ws} to small random values (both positive and
negative)
• 2. Select a training instance t, i.e., a. the vector {xk(t)}, i = 1,...,Ninp (a
pair of input and output patterns), from the training set
• 3. Apply the network input vector to network input
• 4. Calculate the network output vector {zk(t)}, k = 1,...,Nout
• 5. Calculate the errors for each of the outputs k , k=1,...,Nout, the
difference between the desired output and the network output
• 6. Calculate the necessary updates for weights -ws in a way that
minimizes this error
• 7. Adjust the weights of the network by - ws
• 8. Repeat steps for each instance (pair of input–output vectors) in the
training set until the error for the entire system
Unsupervised
• Given a set of examples with no labeling,
group them into sets called clusters
• A cluster represents some specific underlying
patterns in the data
• Useful for finding patterns in large data sets
• Form clusters of input data
• Map the clusters into outputs
• Given a new example, find its cluster, and
generate the associated output
Self-organizing neural networks:
clustering, quantization, function approximation, Kohonen maps
1. Each node's weights are initialized
2. A data input from training data (vector) is chosen at random and
presented to the cluster lattice
3. Every cluster centre is examined to calculate which weights are most
like the input vector. The winning node is commonly known as the Best
Matching Unit (BMU)
4. The radius of the neighborhood of the BMU is now calculated. Any
nodes found within this radius are deemed to be inside the BMU's
neighborhood
5. Each neighboring node's (the nodes found in step 4) weights are
adjusted to make them more like the input vector. The closer a node is to
the BMU, the more its weights get altered
6. Repeat steps for N iterations
• Different kinds of learning…
• • Supervised learning:
• – Someone gives us examples and the right answer for
• those examples
• – We have to predict the right answer for unseen examples
• • Unsupervised learning:
• – We see examples but get no feedback
• – We need to find patterns in the data
• • Reinforcement learning:
• – We take actions and get rewards
• – Have to learn how to get high rewards
• Example of supervised learning:
• classification
• • We lend money to people
• • We have to predict whether they will pay us back or not
• • People have various (say, binary) features:
• – do we know their Address? do they have a Criminal record? high
• Income? Educated? Old? Unemployed?
• • We see examples: (Y = paid back, N = not)
• +a, -c, +i, +e, +o, +u: Y
• -a, +c, -i, +e, -o, -u: N
• +a, -c, +i, -e, -o, -u: Y
• -a, -c, +i, +e, -o, -u: Y
• -a, +c, +i, -e, -o, -u: N
• -a, -c, +i, -e, -o, +u: Y
• +a, -c, -i, -e, +o, -u: N
• +a, +c, +i, -e, +o, -u: N
• • Next person is +a, -c, +i, -e, +o, -u. Will we get paid back?
• Classification…
• • We want some hypothesis h that predicts whether we will be
• paid back
• +a, -c, +i, +e, +o, +u: Y
• -a, +c, -i, +e, -o, -u: N
• +a, -c, +i, -e, -o, -u: Y
• -a, -c, +i, +e, -o, -u: Y
• -a, +c, +i, -e, -o, -u: N
• -a, -c, +i, -e, -o, +u: Y
• +a, -c, -i, -e, +o, -u: N
• +a, +c, +i, -e, +o, -u: N
• • Lots of possible hypotheses: will be paid back if…
• – Income is high (wrong on 2 occasions in training data)
• – Income is high and no Criminal record (always right in training data)
• – (Address is known AND ((NOT Old) OR Unemployed)) OR ((NOT
• Address is known) AND (NOT Criminal Record)) (always right in training
• data)
• • Which one seems best? Anything better?
• Occam’s Razor
• • Occam’s razor: simpler hypotheses tend to
• generalize to future data better
• • Intuition: given limited training data,
• – it is likely that there is some complicated hypothesis
• that is not actually good but that happens to perform
• well on the training data
• – it is less likely that there is a simple hypothesis that
• is not actually good but that happens to perform
• well on the training data
• • There are fewer simple hypotheses
• • Computational learning theory studies this in
• much more depth
• Different approach: nearest neighbor(s)
• • Next person is -a, +c, -i, +e, -o, +u. Will we get paid
• back?
• • Nearest neighbor: simply look at most similar example
• in the training data, see what happened there
• +a, -c, +i, +e, +o, +u: Y (distance 4)
• -a, +c, -i, +e, -o, -u: N (distance 1)
• +a, -c, +i, -e, -o, -u: Y (distance 5)
• -a, -c, +i, +e, -o, -u: Y (distance 3)
• -a, +c, +i, -e, -o, -u: N (distance 3)
• -a, -c, +i, -e, -o, +u: Y (distance 3)
• +a, -c, -i, -e, +o, -u: N (distance 5)
• +a, +c, +i, -e, +o, -u: N (distance 5)
• • Nearest neighbor is second, so predict N
• • k nearest neighbors: look at k nearest neighbors, take
• a vote
• – E.g., 5 nearest neighbors have 3 Ys, 2Ns, so predict Y
• Another approach: perceptrons
• • Place a weight on every attribute, indicating how
• important that attribute is (and in which direction it
• affects things)
• • E.g., wa = 1, wc = -5, wi = 4, we = 1, wo = 0, wu = -1
• +a, -c, +i, +e, +o, +u: Y (score 1+4+1+0-1 = 5)
• -a, +c, -i, +e, -o, -u: N (score -5+1=-4)
• +a, -c, +i, -e, -o, -u: Y (score 1+4=5)
• -a, -c, +i, +e, -o, -u: Y (score 4+1=5)
• -a, +c, +i, -e, -o, -u: N (score -5+4=-1)
• -a, -c, +i, -e, -o, +u: Y (score 4-1=3)
• +a, -c, -i, -e, +o, -u: N (score 1+0=1)
• +a, +c, +i, -e, +o, -u: N (score 1-5+4+0=0)
• • Need to set some threshold above which we predict to
• be paid back (say, 2)
• • May care about combinations of things (nonlinearity) –
• generalization: neural networks
• Reinforcement learning
• • There are three routes you can take to work: A,
• B, C
• • The times you took A, it took: 10, 60, 30 minutes
• • The times you took B, it took: 32, 31, 34 minutes
• • The time you took C, it took 50 minutes
• • What should you do next?
• • Exploration vs. exploitation tradeoff
• – Exploration: try to explore underexplored options
• – Exploitation: stick with options that look best now
• • Reinforcement learning usually studied in MDPs
• – Take action, observe reward and new state
• Bayesian approach to learning
• • Assume we have a prior distribution over the long term
• behavior of A
• – With probability .6, A is a “fast route” which:
• • With prob. .25, takes 20 minutes
• • With prob. .5, takes 30 minutes
• • With prob. .25, takes 40 minutes
• – With probability .4, A is a “slow route” which:
• • With prob. .25, takes 30 minutes
• • With prob. .5, takes 40 minutes
• • With prob. .25, takes 50 minutes
• • We travel on A once and see it takes 30 minutes
• • P(A is fast | observation) = P(observation | A is
• fast)*P(A is fast) / P(observation) = .5*.6/(.5*.6+.25*.4)
• = .3/(.3+.1) = .75
• • Convenient approach for decision theory, game theory
• Learning in game theory
• • Like 2/3 of average game
• • Very tricky because other agents learn at the
• same time
• • From one agent’s perspective, the
• environment is changing
• – Taking the average of past observations may not
• be good idea