13 Ann
13 Ann
The future of AI
Restaurant Data Set
Limited Expressiveness of Perceptrons
The XOR affair
• Minsky and Papert (1969) showed
certain simple functions cannot be
represented (e.g. Boolean XOR).
Killed the field!
• Mid 80th: Non-linear Neural
Networks (Rumelhart et al. 1986)
The XOR affair
Neural Networks
• Rich history, starting in the early forties
(McCulloch and Pitts 1943).
• Two views:
– Modeling the brain
– “Just” representation of complex functions
(Continuous; contrast decision trees)
• Much progress on both fronts.
• Drawn interest from: Neuroscience,
Cognitive science, AI, Physics, Statistics, and
CS/EE.
Neuron
Neural Structure
1. Cell body; one axon (delivers output to other connect neurons); many
dendrites (provide surface area for connections from other neurons).
2. Axon is a single long fiber. 100 or more times the diameter of cell body.
Axon connects via synapses to dendrites of other cells.
Activation Functions:
Jonathan
Mary
Joe
Elizabeth
Alice
• Update weights
Backpropagation Training (Overview)
Training data:
– (x1,y1),…, (xn,yn), with target labels yz {0,1}
Optimization Problem (single output neuron):
– Variables: network weights wij
– Obj.:E=minw∑z=1..n(yz–o(xz))2,
– Constraints: none
Algorithm: local search via gradient descent.
• Randomly initialize weights.
• Until performance is satisfactory,
– Compute partial derivatives ( E / wi j) of objective
function E for each weight wi j
– Update each weight by wi j à wi j + ( E / wi j)
Smooth and Differentiable Threshold Function
• Given too many hidden units, a neural net will simply memorize the
input patterns (overfitting).
• Given too few hidden units, the network may not be able to
represent all of the necessary generalizations (underfitting).
How long should you train the net?
A B C D E
• If you train the net for too long, then you run the risk of
overfitting.
– Select number of training iterations via cross-validation on a
holdout set.
Regularization
• Simpler models are better
• NN with smaller/fewer weights are better
– Add penalty to total sum of absolute weights
– Pareto optimize
Design Decisions
• Choice of learning rate
• Stopping criterion – when should training stop?
• Network architecture
– How many hidden layers? How many hidden units
per layer?
– How should the units be connected? (Fully? Partial?
Use domain knowledge?)
• How many restarts (local optima) of search to
find good optimum of objective function?
Spiking Nets
• Represent continues values using rates
– Output spike if # of incoming spikes > threshold
– Leaky counter
http://www.ine-news.org
Spiking
From http://www.cs.uu.nl/research/techreps/repo/CS-2003/2003-008.pdf
Recurrent networks
• Nodes connect
– Laterally
– Backwards,
– To themselves
• Complex behavior
– Dynamics, Memory
www.stowa-nn.ihe.nl/ANN.htm
Learning Network Topology
• Optimal Brain Damage algorithm
– Trains a fully connected network
– Removes connections and nodes that contribute least
to the performance
• Using information-theoretic criteria
– Repeats until performance starts decreasing
• Tiling algorithm: Grows networks
– Start with a small network that classifies many
examples
– Repeatedly add more nodes to classify remaining
examples
Hyper-Networks
• Use a network to generate a network
– E.g. to determine connection wij use network that
takes in i and j and produces w.
– In 2D: