Neural Network Presentation
Neural Network Presentation
What can they do? How do they work? What might we use them for it our project? Why are they so cool?
History
late-1800's - Neural Networks appear as an analogy to biological systems 1960's and 70's Simple neural networks appear
Fall out of favor because the perceptron is not effective by itself, and there were no good algorithms for multilayer nets Neural Networks have a resurgence in popularity
Applications
ALVINN TD-BACKGAMMON
ALVINN
Autonomous Land Vehicle in a Neural Network Robotic car Created in 1980s by David Pomerleau 1995
Drove 1000 miles in traffic at speed of up to 120 MPH Steered the car coast to coast (throttle and brakes controlled by human)
TD-GAMMON
Plays backgammon Created by Gerry Tesauro in the early 90s Uses variation of Q-learning (similar to what we might use)
Trained on over 1 million games played against itself Plays competitively at world class level
Basic Idea
Properties
Resistant to errors in the training data Long training time Fast evaluation The function produced can be difficult for humans to interpret
Perceptrons
N inputs, x1 ... xn Weights for each input, w1 ... wn A bias input x0 (constant) and associated weight w0 Weighted sum of inputs, y = w0x0 + w1x1 + ... + wnxn A threshold function, i.e 1 if y > 0, -1 if y <= 0
Diagram
x1
x2 w1
x0
w0
w2
. . . xn wn
y = wixi
Linear Separator
This... + + + x2 But not this (XOR) x2
+
x1 -
x1 +
Boolean Functions
x1
x0=-1 w0 = 1.5 w1=1 w2=1 x1 x0=-1 w0 = 0.5 w1=1 w2=1 x1 OR x2 Thus all boolean functions can be represented by layers of perceptrons! x1 AND x2 x0=-1 w0 = -0.5
x2
w1=1
NOT x1
x1
x2
Gradient Descent
Perceptron training rule may not converge if points are not linearly separable Gradient descent will try to fix this by changing the weights by the total error for all training points, rather than the individual
If the data is not linearly separable, then it will converge to the best fit
Gradient Descent
1 Error function : E x = t d od 2 d D wi E w i= wi w i = t d o d x id
d D 2
wi= wi
Update the weights after each training example rather than all at once Takes less memory Can sometimes avoid local minima must decrease with time in order for it to converge
Single perceptron can only learn linearly separable functions Would like to make networks of perceptrons, but how do we determine the error of the output for an internal node? Solution: Backpropogation Algorithm
We need a differentiable threshold unit in order to continue Our old threshold function (1 if y > 0, 0 otherwise) is not differentiable One solution is the sigmoid unit
Sigmoid Function
Output : o= wx
1 y= y 1 e y = y y 1 y
Variable Definitions
xij = the input from to unit j from unit i wij = the weight associated with the input to unit j from unit i oj = the output computed by unit j tj = the target output for unit j outputs = the set of units in the final layer of the network Downstream(j) = the set of units whose immediate inputs include the output of unit j
Backpropagation Rule
1 Ed w = t k ok 2 k outputs
2
k Downstream j
w jk
Backpropagation Algorithm
For simplicity, the following algorithm is for a two-layer neural network, with one output layer and one hidden layer
Thus, Downstream(j) = outputs for any internal node j Note: Any boolean function can be represented by a two-layer neural network!
BACKPROPAGATION(training_examples,
, n in , nout , n hidden )
Create a feed-forward network with n in inputs, n hidden units in the hidden layer, and n out output units Initialize all the network weights to small random numbers (e.g. between -.05 and .05 Until the termination condition is met, Do --- Propogate the input forward through the network : ---Input the instance x to the network and compute the output o u for every ---unit u in the network --- Propogate the errors backward through the network ---For each network output unit k, calculate its error term k k = o k 1 o k t k o k ---For each hidden unit h, calculate its error term h w hk d k h= o h 1 o h
k outputs
xij
Momentum
Add the a fraction 0 <= < 1 of the previous update for a weight to the current update May allow the learner to avoid local minimums May speed up convergence to global minimum
If you match the training examples too well, your performance on the real problems may suffer
Data from your training set that is not trained on, but instead used to check the function Stop when the performance seems to be decreasing on this, while saving the best network seen so far. There may be local minimums, so watch out!
Representational Capabilities
Boolean functions Every boolean function can be represented exactly by some network with two layers of units
Continuous functions Can be approximated to arbitrary accuracy with two layers of units Arbitrary functions Any function can be approximated to arbitrary accuracy with three layers of units
From Machine Learning by Tom M. Mitchell Input: 30 by 32 pictures of people with the following properties:
Wearing eyeglasses or not Facial expression: happy, sad, angry, neutral Direction in which they are looking: left, right, up, straight ahead
Output: Determine which category it fits into for one of these properties (we will talk about direction)
Input Encoding
The value of the pixel (0 255) is linearly mapped onto the range of reals between 0 and 1
Output Encoding
Could use a single output node with the classifications assigned to 4 values (e.g. 0.2, 0.4, 0.6, and 0.8) Instead, use 4 output nodes (one for each value)
Example: (0.9, 0.1, 0.1, 0.1) = left, (0.1, 0.9, 0.1, 0.1) = right, etc.
Network structure
Inputs
x1
x2 . . .
3 Hidden Units
Outputs
x960
Other Parameters
training rate: = 0.3 momentum: = 0.3 Used full gradient descent (as opposed to stochastic) Weights in the output units were initialized to small random variables, but input weights were initialized to 0
Try it yourself!
Go to the Software and Data page, then follow the Neural network learning to recognize faces link Follow the documentation
You can also copy the code and data from my ACM account (provide you have one too), although you will want a fresh copy of facetrain.c and imagenet.c from the website
/afs/acm.uiuc.edu/user/jcander1/Public/NeuralNetwork