Artificial Neural Networks
Artificial Neural Networks
Artificial Neural Networks
Raphael Cóbe
[email protected]
Introduction
Neural Networks were widely used in the 1980s and 1990s aiming to mimic the functioning of
the human brain. Their popularity declined in the late 1990s but came back into the spotlight
with new approaches based on deep learning. But how do they work? Let’s take a look first at
the structure of a neuron.
• Receive input from other units and decides whether or not to fire;
• Approximately 10 billion neurons in the human cortex, and 60 trillion synapses or
connections [Shepherd and Koch, 1990];
• Energy efficiency of the brain is approximately 10−16 joules per operation per second
against ≈ 10−8 in a computer;
• Can be seen as a directed graph with units (or neurons) situated at the vertices;
• Some are input units;
• Receive signal from the outside world;
• The remaining are named computation units;
• Each unit produces an output
• Transmitted to other units along the arcs of the directed graph;
• Imagine that you want to forecast the price of houses at your neighborhood;
• After some research you found that 3 people sold houses for the following values:
• If you want to sell a 2K sq ft house, how much should ask for it?
• How about finding the average price per square feet?
• $180 per sq ft.
• ŷ = W x + b
• Gradient Descent:
• Finding the minimum of a function;
• Look for the best weights values, minimizing the error;
• Takes steps proportional to the negative of the gradient of the function at the current point.
• Gradient is a vector that is tangent of a function and points in the direction of greatest
increase of this function.
The Perceptron was formally proposed by McCulloch and Pitts in the 1940s with the purpose
of mathematically modeling the human neuron. Although it served as the basis for many
algorithms, its discriminative power is limited, as it can only learn hyperplanes as decision
functions.
To simplify the notation, it’s usual to bring θ to the left side of the equation and assign w0 = θ.
Again, we’ll consider x0 = 1. Thus, we have the updated hypothesis function as follows:
{
+1 if wT x ≥ 0
hw (x) = (2)
−1 otherwise.
√
Recalling that ∥u∥ = u21 + u22 de-
notes the length (magnitude) of u.
where ∀i = 1, 2, . . . , m.
Thus, the weight vector will be rotated (through projections) to the other side. For a dataset
that is linearly separable, it has been mathematically proven that the Perceptron algorithm
has guaranteed convergence.
x1 x2 hw (x)
0 0 g(−30) ≈ 0
0 1 g(−10) ≈ 0
1 0 g(−10) ≈ 0
1 1 g(10) ≈ 1
x1 x2 hw (x)
0 0 g(−10) ≈ 0
0 1 g(10) ≈ 1
1 0 g(10) ≈ 1
1 1 g(30) ≈ 1
x1 hw (x)
0 g(10) ≈ 1
1 g(−10) ≈ 0
1
Suppose g(a) = . We have
1 + e−a
′
g (a) = g(a)(1 − g(a)). Note that
g ′ (a) saturates when a > 5 or a <
−5. Furthermore, g ′ (a) < 1, ∀a.
This means that for networks with
many layers, the gradient tends to
vanish during training.
• Maybe there is a combination of functions that could create hyperplanes that separate
the XOR classes:
• By increasing the number of layers we increase the complexity of the function represented by
the ANN:
(j)
In the illustration above, ai denotes neuron i from layer j, and W (l) is the weight matrix
connecting layers l and l + 1. This architecture is generally represented as n:3:1.
The final decision function of the neural network is given by the following formulation:
( )
(3) (2) (2) (2) (2) (2) (2)
hx (w) = a1 = g w01 a0 + w11 a1 + w21 a2 . (6)
The same happens with the label y of each sample, which now becomes a vector y ∈ R3 .
By analogy, the above formulation encompasses a neural network with c logistic regressors if
we have a logistic activation function in the output layer.
∂J(w)
(l)
, (8)
∂wij
where l = 1, 2, . . . , L − 1 such that L denotes the number of layers in the neural network.
( )T ( )
δ (l) = W (l) δ (l+1) . ∗ g ′ b(l) . (10)
Note
Softmax - function that takes as input a vector of K real numbers, and normalizes it into a
probability distribution
Note
logarithm means changing scale as the error can grow really fast;
Note
As the correct predicted probability decreases, however, the log loss increases rapidly: In case
the model has to answer 1, but it does with a very low probability;
• Difference between the expected (or average) prediction of our model and the correct
value we are trying to predict.
• Imagine repeating the entire model-building process more than once:
• Each time you gather new data and run a new analysis, you create a new model.
• Due to randomness in the underlying data sets, the resulting models will have a variety of
predictions.
• Measures how far, on average, the predictions of these models are from the correct
value.
• Our model has bias if it systematically predicts below or above the target variable.
• Low bias: suggests fewer assumptions about the shape of the target function.
• Regression Trees, KNN Regression.
• High bias: suggests more assumptions about the shape of the target function.
• Linear Regression, Logistic Regression.
• Low variance: suggests small changes in the estimated target function with changes in
the training data.
• Linear Regression, Logistic Regression.
• High variance: suggests large changes in the estimated target function with changes in
the training data.
• Regression Trees, KNN Regression.
• Dropout layers:
• Randomly disable some of the neurons during the training passes.
• Dropout layers:
# Drop half of the neurons outputs from the previous layer
self.fc1_drop = nn.Dropout(0.5)
• Note:
• ”drops out” a random set of activations in that layer by setting them to zero.
• forces the network to be redundant.
• the net should be able to provide the right classification for a specific example even if some
of the activations are dropped out.