0% found this document useful (0 votes)
19 views33 pages

11 Neural Nets Annot

The document provides an introduction to neural networks, comparing their structure and functionality to the human brain. It discusses the advantages of neural networks, including high processing speed and robustness, and outlines key terminology and concepts such as activation functions and network topologies. Additionally, it covers the learning behavior of perceptrons and their limitations in representing certain functions like XOR.

Uploaded by

leimu.864
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views33 pages

11 Neural Nets Annot

The document provides an introduction to neural networks, comparing their structure and functionality to the human brain. It discusses the advantages of neural networks, including high processing speed and robustness, and outlines key terminology and concepts such as activation functions and network topologies. Additionally, it covers the learning behavior of perceptrons and their limitations in representing certain functions like XOR.

Uploaded by

leimu.864
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

Neural Networks

Introduction to Artificial Intelligence


© G. Lakemeyer

G. Lakemeyer

Winter Term 2024/25


Brain vs. Computer
Analogy to the human brain:
Axonal arborization

many processors (neurons) Axon from another cell

and connections (synapses),


Synapse
Dendrite Axon

which process information


Chapter 1 Introduction Nucleus

locally and in parallel.


It
Synapses

Cell body or Soma

von Neumann computer vs. the brain:


© G. Lakemeyer

Supercomputer Personal Computer Human Brain


Computational units 106 GPUs + CPUs 8 CPU cores 106 columns
1015 transistors 1010 transistors 1011 neurons
Storage units 1016 bytes RAM 1010 bytes RAM 1011 neurons
1017 bytes disk 1012 bytes disk 1014 synapses
Cycle time
Operations/sec
10 9 sec
1018
10 9 sec
1010
10 3 sec
1017 ·
3
Figure 1.2 A crude comparison of a leading supercomputer, Summit (?); a typical personal
3
Note: The brain
computer cycle
of 2019; andis
theslow (10
human brain.s )Human
, but updates work
brain power has in
notparallel. A serial
changed much in
simulation
thousandson
of a computer
years, needs several
whereas supercomputers hundred
have improvedcycles.
from megaFLOPs in the 1960s
AI/WS-2024/25 2 / 32
to gigaFLOPs in the 1980s, teraFLOPs in the 1990s, petaFLOPs in 2008, and exaFLOPs in
Advantages of Neural Networks

High processing speed through massive parallelism;


still works if parts of the network are damaged (robustness);
graceful degradation;
designed for inductive learning.
© G. Lakemeyer

It seems reasonable (obvious?) to try and recreate these advantages by


designing artificial neural networks.

Research in this area started already in the 40-ies.


(McCulloch und Pitts 1943)

AI/WS-2024/25 3 / 32
Some Terminology of Neural Networks

Units: represent the nodes (neurons) in the network.


Input/output units: Special nodes connected to the external environment.
Edges between nodes in the network. Each node has
Links:
© G. Lakemeyer

input and output edges.


Weights: Each edge has a weight, usually a real number.
The value computed by a unit given its weighted inputs.
Activation level: It is passed to the neighbor nodes via the output edges.

AI/WS-2024/25 4 / 32
How a Unit Works

a i = g(ini )
aj Wj,i
g
ini
Input Output
ai
Links Σ Links

Input Activation
© G. Lakemeyer

Function Function Output

0 1
X
ai := g (ini ) = g @ Wj ,i ⇥ aj A
j

g is usually a nonlinear function.

AI/WS-2024/25 5 / 32
Activation Functions
Some important activation functions are: s(x) = e
- Y

ai ai ai

- +1

t ini
+1

ini
+1

ini

−1 s'(x) = S(x))) -
S(x)
© G. Lakemeyer

(a) Step function (b) Sign function (c) Sigmoid function

Note: t in step represents a threshold, that is, the minimum total weighted
input necessary for the neuron to fire (similar to actual neurons in the brain).
Mathematically, one can always use a threshold of 0 by having an additional
input with activation level a0 = 1 and W0,i = t. Then
0 1 0 1
Xn Xn
ai = stept @ Wj ,i ⇥ aj A = step0 @ Wj ,i ⇥ aj A
j =1 j =0
AI/WS-2024/25 6 / 32
Representing Boolean Functions
Inputs are 0 or 1

0 - 1
W=1 W=1
W = −1
t = 1.5 t = 0.5 t = −0.5

W=1 W=1 1 - 0
© G. Lakemeyer

AND OR NOT

Thus neural nets can at least represent any arbitrary Boolean function.

[We will see below that they can represent much more.]

AI/WS-2024/25 7 / 32
Network Topologies

Feed-forward: Units and links represent a directed acyclic graph.


Recurrent: Units and links represent arbitrary directed graphs.
Nets are often arranged in layers, that is, nodes of one layer are linked only to
nodes of the next layer. There are no links between nodes of a layer.
(When counting layers, the input layer is ignored.)
© G. Lakemeyer

Of all networks feed-forward nets


Example of a feed-forward
are the best understood.
2-layer network: (The output is solely a function of
w13 the inputs and the weights.)
I1 H3 Recurrent networks are often
w35
w14 unstable, oscillate, or have chaotic
O5 behavior.
w23 Note: The brain is massively
w45
I2 H4
w24 recurrent.

AI/WS-2024/25 8 / 32
Recurrent Network Types – Hopfield Nets
Recurrent nets with symmetric bi-directional links (Wi ,j = Wj ,i ).
All units are both input and output units.
Nets function as associative memory.
After training a new input is matched against the “closest” example seen
during training.

Example:
© G. Lakemeyer

Training with n photographs.


New input = part of a photograph used during training.
Then the network reconstructs the whole image.

Note: Each weight is a partial representation of every image!

Theorem
A Hopfield Net can reliably store 0, 138 ⇥ N examples, where N is the number
O

of units.
AI/WS-2024/25 9 / 32
Recurrent Network Types – Boltzmann Machines
G Hinton .

Also use symmetric weights.


There are units which are neither input nor output units.
Stochastic Activation function.
© G. Lakemeyer

State transitions are like simulated annealing search for the configuration
that best approximates the training set.
(Formally identical to a certain kind of belief nets.)

AI/WS-2024/25 10 / 32
Feed-Forward (FF) Nets
a w13 93
Hidden Units:
I1 H3 Units without direct connection to the
w35
w14 external environment. Hidden units are
aj
O5 organized in one or more hidden
>
layers.
-

w23
w45 Perceptrons:
I2 H4 dy
w24 FF Networks without hidden units.
a2
must
act fut non-linea
be
© G. Lakemeyer

FF Nets represent complex nonlinear functions. .

With 1 hidden layer every continuous function is representable.


With 2 hidden layers every function is representable.

When the topology and activation function g are fixed, the representable
functions have a specific parametrized form (parameters = weights).

Example:
a5 = g (W3,5 a3 + W4,5 a4 ) = g (W3,5 g (W1,3 a1 + W2,3 a2 ) + W4,5 g (W1,4 a1 + W2,4 a2 ))
) Learning = Search for the correct parameters = nonlinear regression.
AI/WS-2024/25 11 / 32
How to encode the input?
"Patrons
D local encoding for
1 input unit with 3 values
none = 0

Some =. 5

full = 1
© G. Lakemeyer

2) distributed : use
3 units for pations
Phone
psome
Pful
some is represented as

Phone =
O
Psomo = 1

AI/WS-2024/25 Pfull = O
12 / 32
Optimal Network structure?
Choosing the right network is a difficult problem!
Also the optimal net may be exponentially large (relative to the input).
Net is too small: the desired function is not representable.
Net is too big: Net memorizes examples without generalization
(analogous to memorizing in decision trees) Overfitting

There is no good theory of how to choose the right network, only some
© G. Lakemeyer

heuristics.
Heuristics of optimal brain damage:
Start with a network with a maximal number of connection. After the 1st
training reduce the number of connections using information theory. Iterate.

Example: -postal
Network to recognize zip codes. 3/4 of the initial connections could be removed.
(There are also methods to move from a network with few nodes to a network with
more nodes.)

AI/WS-2024/25 13 / 32
Perceptrons
© G. Lakemeyer

Ij Wj,i Oi Ij Wj O
Input Output Input Output
Units Units Units Unit

Perceptron Network Single Perceptron

0 1
X
O = Step0 @ Wj Ij A = Step0 (W · I )
j

AI/WS-2024/25 14 / 32
Representing the Majority Function
Inputs e 40 13
,

F -half of the
© G. Lakemeyer

Ij are 1

Recall : DTsane of Sire0(24)


in order to represent
this fet .

AI/WS-2024/25 15 / 32
How Do Perceptrons Learn?
function NEURAL-NETWORK-LEARNING(examples) returns network
network a network with randomly assigned weights
repeat
for each e in examples do

Jepoch
O NEURAL-NETWORK-OUTPUT(network, e)
T the observed output values from e
update the weights in network based on e, O, and T
end
until all examples correctly predicted or stopping criterion is reached
© G. Lakemeyer

return network

Error = T O with
O = predicted (actual) output rate OLC
T = correct output
learning
Update-Rule: Wj := Wj + ↵ ⇥ Ij ⇥ Error

Theorem: [Rosenblatt 1960]


Learning using the update rule always converges to the correct output for
representable functions.
AI/WS-2024/25 16 / 32
Effect of Error Correction on Weights
0 =

g([wjFj) =
g([(Wj +
Ij .

) ·

Fj)
Let EJO (actual output in too small

if Ij = 0 then Wi doe not change


© G. Lakemeyer

Ij<o the Wi becomes smalle

(negative in fluence of IJ Shrink


Ij the-W; becomes
> o
bigge
(position info of Ij
.

increases
(

AI/WS-2024/25 17 / 32
What can Perceptrons represent?
Answer: Not much!

Perceptrons can represent exactly the class of linearly separable functions.

Examples: I ,Iz 1 5
= .

I1 I1 I1

-
1 1 1

?
© G. Lakemeyer

0 0 0
0 1 I2 0 1 I2 0 1 I2
(a) I1 and I2 (b) I1 or I2 (c) I1 xor I2

I1
fan-dimens .

f(I .. In ,
Is) = ( Bors .

fat
a Cn-1) dimens
if
W = −1
at most one .

I2
hyperplane
W = −1
t = −1.5
o the input W = −1 separate
I3
On
from I'm
(a) Separating plane (b) Weights and threshold
AI/WS-2024/25 18 / 32
Why XOR is not Representable

- I

Theorem :
There is no perception using step fat a

as activ
Set that can represent XOR
© G. Lakemeyer

Proof
-
: assume
Wi
otherwise

-XOR
X =
y =
0 = g(t) = G

= W > t
X =
1 , y
= 0 =
g(w) = 1 , ,

X = 0
, y
= 1 =
g(Wal = 1 = Was t

AI/WS-2024/25
X= 1 , y
= 1 =
g(w + wz) =
16 19 / 32
Learning Behavior I: Majority function

0.9
% correct on test set

0.8
© G. Lakemeyer

0.7

0.6 Perceptron
Decision tree
0.5

0.4
0 10 20 30 40 50 60 70 80 90 100
Training set size
AI/WS-2024/25 20 / 32
Learning Behavior II: Restaurant Example

0.9
% correct on test set

0.8 Perceptron
Decision tree
© G. Lakemeyer

0.7

0.6
Restaurant domain

0.5 not lin .


separable.

0.4
0 10 20 30 40 50 60 70 80 90 100
Training set size

AI/WS-2024/25 21 / 32
Multi-layer Feed-Forward Networks

Output units Oi

Wj,i

Hidden units a j
© G. Lakemeyer

Wk,j

Input units Ik

(This topology is suitable for the restaurant example.)


(local encoding

AI/WS-2024/25 22 / 32
Back-Propagation
function BACK-PROP-UPDATE(network, examples, ) returns a network with modified weights
inputs: network, a multilayer network
examples, a set of input/output pairs
, the learning rate
repeat
for each e in examples do
/* Compute the output for this example */
O RUN-NETWORK(network, Ie )
/* Compute the error and ∆ for units in the output layer */
Erre Te O
© G. Lakemeyer

/* Update the weights leading to the output layer */


-
Wj, i Wj, i + aj Errie
-
g (ini )
for each subsequent layer in network do - Ai
/* Compute the error at each node */
∆j g (inj ) i Wj, i ∆i
/* Update the weights leading into the layer */
Wk, j Wk, j + Ik ∆j
end (a , )
end
until network has converged
return network

Errie = Error Tie Oie of the i-th output unit. Err e = error vector.
0
i = Error
E i ⇥ g (ini )

AI/WS-2024/25 23 / 32
Some Intuition behind Backpropagation

Will Not : if g'ling) in near o

Is
then changes of Wo , p
have
little effect

Wa
© G. Lakemeyer

.
6
: =

Was th In.

1s =
g(ing) [ W Di

estimates the pation of the ene

which unit G ricresponsible for

AI/WS-2024/25 24 / 32
The Math behind Backpropagation
Let W be the vetor of all weights.

The the total Error E(w) =E [CTi- 02

E(W) [E(Ti-g([Wji aj))2


·

= ElTi-g([Wiig(EWej.In)))
© G. Lakemeyer

( for a
2-laye FF-network)

-With Di
aj
-
=
.

similarly
AI/WS-2024/25
I D 25 / 32
Back-Propagation = Gradient Descent
Gradient Descent
is like Hill Climbing except that we are looking for minima instead of maxima.

Err
© G. Lakemeyer

W1
- a

W2
AI/WS-2024/25 26 / 32
Restaurant Example

0.9
% correct on test set

0.8
© G. Lakemeyer

0.7

0.6 Multilayer network


Decision tree
0.5

0.4
0 10 20 30 40 50 60 70 80 90 100
Training set size

AI/WS-2024/25 27 / 32
Restaurant Example

14

12 squared ena (T-op


mea examples
Total error on training set
100
10

8
© G. Lakemeyer

0
0 50 100 150 200 250 300 350 400
Number of epochs

AI/WS-2024/25 27 / 32
Summary

What are they good for?


Neural nets are useful for attribute-based representations (like decision
trees), in particular also for attributes with continuous values.
Size of a net:
2n /n units to represent any Boolean function.
() 2n weights). In practice one often needs much less.
Efficiency of Learning:
© G. Lakemeyer

no useful theoretical results.


Also a problem with local minima. Simulated annealing sometimes
helpful (see Boltzmann Maschines)
A
Generalization: Good if the right network is used :-)
Noise: no problem.
Transparency:
bad! Often one has no idea why a network yields a certain output.
Decision trees are much better in this regard!
Using additional knowledge: bad. -
No theoretical results.
Not
AI/WS-2024/25 many 28 / 32
Application: Pronunciation
NETtalk synthesizes speech from written text (in English) [Sejnowski and
Rosenberg 1987]
Input: seven-character window sliding over the written text.:

T I S I S A T E X T

uses 29 input units per character (for the 26 letters of the alphabet, blank,
© G. Lakemeyer

comma, and 1 unit for other characters)


Hidden layer: 80 units
Output: Units to encode phonemes.
Training set: text with 1024 words. After 50 epochs 95% correctness.
Problems: words with identical spelling but different pronunciation (e.g. lead).
Result on test set: 78% correctness.
(Today there are better systems not based on neural nets, but NETtalk paved the way
for the commercial success of neural nets.)
AI/WS-2024/25 29 / 32
Application: Recognizing Zip Codes (LeNet)

Neural net to recognize handwritten numbers (zip codes) [Le Cun 89]
Input: 16 ⇥ 16 pixels per digit.
3 hidden layers: 768; 192; 30 units.
Output: 10 units for digits 0–9.
Limitation of connections was crucial:
© G. Lakemeyer

each unit of the 1st layer was connected with a 5 ⇥ 5 array of the input (25
connections).
First layer was organized in 12 groups with 64 units each. All units of the same
group use identical weights.
The whole network used only 9760 different weights instead of 200,000.

Training set: 7300 examples.


Test set: 2000 examples; 99% correctness!
Is used by US Postal Service, realized on VLSI.

AI/WS-2024/25 30 / 32
LeNet was the First Convolutional Neural Net

#
Ist
laye
© G. Lakemeyer

AI/WS-2024/25 31 / 32
A Typical Convolutional Neural Network (CNN) Today

Source: Wikimedia Commons: commons.wikimedia.org


© G. Lakemeyer

Alternating convolutional and subsampling layers plus output layer


Convolution: Maps overlapping regions of the input to the same unit
Each feature map extracts a different feature (e.g. lines, curves)
Subsampling: (pooling) produces information loss (e.g. take the max of
neighboring units)
Pre-training hidden layers:
Starting with the input, train first hidden layer using unsupervised learning
Then fix weight of that layer and use it as input for training the next layer
(Supervised) Training of the whole network uses Back-Propagation

AI/WS-2024/25 32 / 32

You might also like