Neural Network Basics 2.1 Neurons or Nodes and Layers
Neural Network Basics 2.1 Neurons or Nodes and Layers
36 37
I1 I2 I3 I4
Fig. 2.1 shows the structure with just one building component. Such N1 N2
nodes are chained together with many aritifical neurons to construct a
network. In Fig. 2.2 there are three neurons. Now the activation
functions and intermediate outputs are included implicitly in the nodes N3
and weights in arcs (connections) between nodes. The strucure in Fig.
2.2 could still be a part of a larger network. The input and output are
often a special type of neurons, either to accept input data or to O
generate output values of the network.
Fig. 2.2 An artificial neural network with four layers of input nodes {I1,
I2, I3, I4}, hidden nodes {N1, N2}, {N3} and output node {O}.
38 39
Input layer I1 I2 I3 I4
In Fig. 2.2. there are four layers called input layer, two hidden layers
and ouput layer. Normally, all nodes of a single layer have the same
properties like activation function and type like input, hidden or Hidden layer 1 N1 N2
output. Note that these node types are used in feedforward networks,
that is multilayer percoptrons. Still, virtually always the nodes of the
same layer are of the same type, and input and output have to be
N3 N4
taken care of. Hidden layer 2
The network in Fig. 2.2 is extended in Fig. 2.3 the network of which
depicts a common type feedforward network, however, a small one as Output layer O
to the numbers of nodes in its layers. Note in the sense of a directed
graph data structure it is ”complete” as to arcs between the layers: Fig. 2.3 A fully connected feedforward or multilayer perceptron
there exist all possible arcs from each node of a layer to the nodes of network.
the following layer. On the other hand, there are no lateral arcs
between the nodes of the same layer in feedforward networks.
40 41
2.2 Types of neurons or nodes Input, hidden and output nodes
The basic forms of neural networks are typically feedforward ones. Notice that input nodes do not have activation functions. Thus, they
Recursive networks do also exist even if obviously they do not have so are little more than placeholders. The input is simply weighted and
many and versatile forms compared to the former. As mentioned summed. Furthermore, the size of input and output vectors will be the
above, the types or roles of nodes also vary. Sometimes the same node same if the neural network has nodes that are both input and output.
may have more than one role. For instance, Boltzmann machines are an
example of a neural network architecture in which nodes are both Hidden nodes have two important characteristics. First, they only
input and output. receive input from the other nodes, such as input or preceding hidden
Normally the input to a neural network is represented as an array or nodes. Second, they only output to other nodes, either as output or
vector as in Equation 2.1., in which the vector is of dimension p=dj and j other, following hidden nodes. Hidden nodes are not directly
denotes the layer. For the input layer the dimension d1 is equal to the connected to the incoming data or to the eventual output. They are
number of input variables. often grouped into fully connected hidden layers.
42 43
A common question concerns the number of hidden nodes in a Another reason why additional hidden layers seemed to be a problem
network. Since the answer is complex, this question will be considered was that they would require a very extensive training set to be able to
in different contexts. It is good to notice that the numbers of layers and compute weights for the network. Before deep learning, the former
nodes affect the time complexity of the use of a neural network. situation was actually a problem, since deep learning means networks
Prior the time of deep learning, it was suggested that one or two of several hidden layers. Although networks of one or two hidden
hidden layers are enough so that a feedforward network can function layers are able to learn ”everything” in theory, deep learning facilitates
virtually as a universal approximator for any mathematical function. Let a more complex representation of patterns in the data.
us remember that if there is one hidden layer, there are two processing
layers, the hidden layer and output layer. The above-mentioned
approximation of any function is, however, a theoretical thought,
because it does not express how the approximation could be made.
44 45
B
Input layer I1 I2 1
Bias nodes
Bias nodes are added to feedforward neural networks to help these Hidden layer 1 N1 N2
B
2
learn patterns. Bias nodes function like an input node that always
produces constant value 1 or other constant. Because of this property,
they are not connected to the previous layer. The constant 1 here is
called the bias activation. Not all neural networks have bias nodes. Fig. Hidden layer 2 N3 N4
B
3
2.4 depicts a two-hidden-layer network with bias nodes. The network
includes three bias nodes. Bias neurons allow the output of an
activation function to be shifted. This will be presented later on, in the Output layer O
context of activation functions.
Regardless of the type of neuron, node or processing unit, neural
networks almost always are constructed of weighted connections Fig. 2.4 A feedforward network with bias nodes B1, B2 and B3.
between these units.
46 47
(2.3)
Equation (2.3) outputs value 1 for inputs of 0.5 or greater and 0 for all
other values. Step functions are also called threshold functions because
they only return 1 (true) for those values above some threshold given, (a) (b)
e.g., according to Fig. 2.7(a). The next phase is to form a ”ramp” as in Fig. 2.7 (a) Step activation function; (b) Linear threshold between bounds,
Fig. 2.7.(b). otherwise 0 or 1.
50 51
(2.4)
52 53
The hyperbolic tangent function is also one of the most important
activation functions. It is restricted into the range between -1 and 1.
(2.5)
54 55
Teh and Hinton (2000) introduced the rectified linear unit (ReLU). It is
simple and seen a good choice for hidden layers.
(2.6)
The advantage of the rectified linear unit comes partly from that it is a
linear, non-saturating function. Unlike the sigmoid or hyperbolic
tangent activation functions, ReLU does not saturate to -1, 0 or 1. See
Fig. 2.10. A saturating activation function moves towards and
eventually attains a value. For instance, the hyperbolic function Fig. 2.10 Rectified linear unit activation function.
saturates to -1 as x decreases and to 1 as x increases.
56 57
The final activation function is the softmax function. Along with the Let us recall the iris data containing flowers from three iris species.
linear function, softmax is usually found in the output layer of a neural When we input a data case to the neural network applying the softmax
network. The node that has the greatest value claims the input as a activation function, this allows the network to give the probability that
member of its class. Because it is a preferable method, the softmax these measurements belong to each of three species. For example,
activation function forces the output of the neural network to their probabilities could be 80%, 15% and 5%. Since these are
represent the probability that the input falls into each of the classes. probabilities, their sum must add up 100%. Output nodes do not
Without the softmax, the node’s outputs are simply numeric values, inherently specify the probabilities of the classes. Therefore, softmax is
with the greatest indicating the winning class. useful, when it produces such probabilites. The softmax function is as
follows.
58 59
60 61
OR
2.5 Logic with neural networks I1 I2 B1
AND
Logical operators 0 AND 0 =0 I1 1
1 -0.5
I2 B1
can be 1 AND 0 = 0 NOT
implemented with O1
neural networks. 0 AND 1 = 0 1 -1.5 I1 B1
1
Let us look at the 1 AND 1 = 1
truth table of 0 OR 0 = 0 O1 0.5
operators AND, 0 OR 1 = 1
-1
OR, NOT. Neural
(a) (b) networks can 1 OR 0 = 1 O1
Fig. 2.11(a) Sigmoids for weights w in {0.5, 1.0, 1.5, 2.0}, the greater represent these 1 OR 1 = 1
weight, the steeper curve, and (b) bias b in {0.5, 1.0, 1.5, 2.0} (w=1.0), according to Fig. NOT 0 = 1 Fig. 2.12 The logical operators as networks.
the greater bias, the leftmost curve because of the shift being not 2.12 NOT 1 = 0 AND with 1 inputs: 1*1+1*1+(-1.5)=0.5>0; true
complete when the all curves merge together at the top or bottom left.
62 63
In Fig. 2.12 the following function is used From Eq. (2.9) (b=w0, x0=1) expression Class A
(2.9) (2.11)
where fh is a step function named the Heaviside function with p can be represented as a line mapped in Class B
2-dimensional (two variables)
variables and bias b Euclidean space to distinguish two
(2.10) separate classes of cases or datapoints.
Vector x represents any case in the w
variable space. (A situation of two fully
and which produces outputs either 1 or 0. separate classes is idealistic, in fact, not x1
encountered in actual data sets.) Fig. 2.13 Two distinct sets of cases or
patterns in 2-dimensional space.
64 65
I1 I2 B1
Exclusive or (XOR) 1 1
-0.5
1 1 -1.5
(0,1)
(1,1)
One can find out easily that XOR is not Using two or more processing layers XOR N2
N1 B2
possible to implement with a single (operator ) for inputs p and q
0.5
(processing) layer of nodes when -1
looking at Fig. 2.14 and noticing that a
single feedforward or perceptron layer (2.12) N3 B3
-1.5
can correspond to linear mappings only. 1 1
Namely, by locating a line at whatever can be implemented as depicted in Fig. O1
positions it is not possible to distinguish (0,0) (1,0)
2.15.
the two classes of true outputs for Fig. 2.15 Two classes of XOR can
{(1,0),(0,1)} and false outputs for be separated with more than one
{(0,0),(1,1)} by using one line only. Fig. 2.14 Two classes of XOR cannot
be separated with one line. processing layer.
66 67