Introduction To Deep Learning - With Complexe Python and TensorFlow Examples - Jürgen Brauer PDF
Introduction To Deep Learning - With Complexe Python and TensorFlow Examples - Jürgen Brauer PDF
August 2018
1
2
CONTENTS 3
5 The Perceptron 64
5.1 The Perceptron neuro-computer . . . . . . . . . . . . . . . . . . . . . . 64
5.2 Perceptron learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.3 Perceptron in Python . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.4 Limitations of the Perceptron . . . . . . . . . . . . . . . . . . . . . . . 76
6 Self-Organizing Maps 82
6.1 The SOM neural network model . . . . . . . . . . . . . . . . . . . . . 82
6.2 A SOM in Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
6.3 SOM and the Cortex . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
8 TensorFlow 140
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
8.2 Training a linear model with TensorFlow . . . . . . . . . . . . . . . . . 149
8.3 A MLP with TensorFlow . . . . . . . . . . . . . . . . . . . . . . . . . . 151
12 Exercises 216
12.1 Ex. 1 - Preparing to work with Python . . . . . . . . . . . . . . . . . 216
12.2 Ex. 2 - Python syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
12.3 Ex. 3 - Understanding convolutions . . . . . . . . . . . . . . . . . . . . 223
12.4 Ex. 4 - NumPy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
12.5 Ex. 5 - Perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
12.6 Ex. 6 - Speech Recognition with a SOM . . . . . . . . . . . . . . . . . 233
12.7 Ex. 7 - MLP with feedfoward step . . . . . . . . . . . . . . . . . . . . 234
12.8 Ex. 8 - Backpropagation . . . . . . . . . . . . . . . . . . . . . . . . . . 235
12.9 Ex. 9 - A MLP with TensorFlow . . . . . . . . . . . . . . . . . . . . . 236
12.10Ex. 10 - CNN Experiments . . . . . . . . . . . . . . . . . . . . . . . . 237
12.11Ex. 11 - CNN for word recognition using Keras . . . . . . . . . . . . . 238
12.12Ex. 12 - Vanishing gradients problem . . . . . . . . . . . . . . . . . . . 239
12.13Ex. 13 - Batch normalization in TensorFlow . . . . . . . . . . . . . . . 240
1
What is Deep Learning?
Christine: I learned the whole day for my oral exam tomorrow in psychology.
She studied towards a magister degree where you have to choose a major and two mi-
nors. Psychology was one of her minors.
Jürgen: Did you know that I changed from University of Trier after my intermediate
diploma to University of Bonn since I am very interested in neuroscience? They have
an department of neuroinformatics here.
5
CHAPTER 1. WHAT IS DEEP LEARNING? 6
Jürgen: You simulate the functioning of biological neurons with the help of artificial
neuron models and connect many of them to artificial neural networks.
Jürgen: They are called neurons! Christine, you really should know what a neuron is
if this is the evening before your oral exam in psychology! Neurons are the building
blocks of our brains! They are the basis of all mental processes.
She rapidly finished her dinner, then went to her room to read into the topic. After
some minutes the door openend and she asked:
Jürgen: Protons!?
Christine: grins at me
@Christine: please fogive me, that I make this conversation public. You were really a
funny flatmate.
Artificial Neural Networks (ANNs). In the field of Deep Learning the function
CHAPTER 1. WHAT IS DEEP LEARNING? 7
of these neurons is modeled by simple computing units, called neurons as well. These
technical neuron models are then connected with each other to so called artificial neu-
ral networks. At this point another important idea is borrowed from nature, namely
the way they are connected to each other. From neuroscience it is known, that in
some parts of the brain, biological neurons form ”layers of neurons”, in the sense,
that neuron connections are mainly from one layer to another and only sparsely be-
tween neurons in the same layer. This observation lead to idea of feed forward neural
networks (FF). If further, a certain technical neuron model is used which is called Per-
ception and each neuron in layer i gets its input from all neurons from the previous
layer i − 1, these artificial neural networks are called Multilayer perceptrons (MLP).
This fully connected criterion does not hold in general for all FFNs.
ANNs are not new. Now the MLP model is not new. Already in the 1980s MLPs
were a popular approach to machine learning. However, the model was not suitable
in practice to solve the important machine learning tasks of image classification and
object localization and after some hype about neural networks, the machine learning
community moved to other techniques, as, e.g., Support Vector Machines (SVMs).
CNNs are the drive of the boom. The responsibility for the new boom of the field
has to be attributed mainly to a variant of feedforward neural networks, which is called
Convolutional Neural Network (CNN) and was introduced by LeCun [27]. CNNs do
not only use Perceptrons in the so called convolutional layers and the classification
layers at the end of the network, but they also use other simple computing elements,
as, e.g., a maximum operation in the pooling layer. Probably the key factor for the
success is the insight, that using a hierarchy of neuron layers, where the neurons rep-
resent more and more complex features as we go up in the hierarchy of layers, can
simplify all pattern recognition tasks, such as classification and localization of objects
in images.
Receptive fields. Furthermore, another important idea from biological visual sys-
tems was stolen: the computing elements have receptive fields similar to biological
neurons. This means that they do not get the input from the whole image area, but
from small restricted regions in the image. What this means is that in lower layers
many local classifications are made and the classification results are propagated to the
next layer which again does not get its input (indirectly) from the whole image but
from a subarea of it.
Missing ingredients. However, LeCun presented his work [27] already in 1989! Why
did it need more than 20 years to start the boom? The common understanding in the
community is that at least the following further ingredients were missing:
CHAPTER 1. WHAT IS DEEP LEARNING? 8
Figure 1.1: A feedforward neural network (FFN) model steels two important ideas
from nature: 1. Brains use simple computing elements, called ”neurons”. In a FFN
their functionality is roughly modeled by technical neuron models, as, e.g., Percep-
trons. 2. Neurons form ”layers” in brains in the sense that neurons within a layer are
only sparsely connected with each other and densely between layers. Note, that in
real brains these connections can go to many previous and following layers, whereas
in a standard FFN the connections only go to the next layer.
1. More data. The datasets that CNNs are currently trained on are really large. In
the 1980s storage restrictions did not allow to store datasets with millions of images.
Now, storage space for millions of images is not really a problem any longer. Further-
more, large sets of images can be collected easily from the WWW.
2. Faster computing hardware. Even if enough data storage and large datasets
had been available in the past, the computing power to process these datasets simply
was not available. With the development of faster and multi-kernel CPUs and GPUs
it is now much easier to train a CNN consisting of many layers with many neurons.
CHAPTER 1. WHAT IS DEEP LEARNING? 9
3. Better transfer functions. The technical neuron models used transfer functions
to model the neuronal firing rate in dependence of its input activity. It was shown in
an important paper by Krizhevsky et al. [24], that a transfer function called Rectified
Linear Unit (ReLU), gives better results than the transfer functions usually used be-
fore and equally important, that this transfer function is much faster to compute.
Figure 1.3: Going deeper. For important CNN models presented by research teams in
the last years more and more layers of neurons have been used.
CHAPTER 1. WHAT IS DEEP LEARNING? 12
Figure 1.4: The principle of receptive fields is another important idea that is used
in CNNs and which is clearly inspired by nature. It means that a neuron does not
get its input from all other neurons from a previous layer, but just a small subset.
This allows first to classify locally and communicate the classification result to the
following neuron, which combines classification results from several previous neurons
and for this has an larger effective receptive field.
CHAPTER 1. WHAT IS DEEP LEARNING? 13
Figure 1.5: Convolutional Neural Network were already invented in 1989 by LeCun.
However, the success came only about 20 years later due to some missing ingredients.
CHAPTER 1. WHAT IS DEEP LEARNING? 14
Neurons as a glue for models. This does not mean that we should forget the neu-
rons and start building new models without neurons. Modeling information processing
and information representation with the help of neural networks is still a promising
approach to tackle the problem of building a strong AI. A strong AI is an artificial
intelligence with general knowledge as opposed to a weak AI, an AI without general
knowledge, but very competent in some niche, e.g., playing chess or Go. If different
research groups share a common base, namely neural networks, for building individual
models that solve problems as, e.g., object localization, object and scene classification,
object tracking, attention mechanisms, learning of movements for a robot, etc., there
is a better chance that we will be able to ”glue” these models together as an artificial
brain and put it into a robot one day. For sure, gluing” different models together will
be one of the most trickiest part and will often mean combining existing neural models
into a new model that is different from the individual models.
exploiting local synaptic learning rules (CNNs use supervised learning and non-local
learning rules) where the learning rules were supposed to be driven just by the mas-
sive correlation information inherently in spatio-temporal data (image, videos, music,
etc.), 2. the usage of spatio-temporal correlation information across sensor modalities
for realizing what today is called ’sensor data fusion’, and 3. the usage of so called
spiking neurons, which are biologically much more plausible. I sent this proposal to
the German neuroscientist Prof. Christoph von der Malsburg. It seemed to me that
he was someone who had devoted his life to neurosciences [30] and that he should
recognize the importance of these ideas. And he really helped me! Since he had no
positions at this time, he sent my proposal to one of his colleagues who called me
on the phone. After some discussion at the phone an interview followed and finally
he offered me a PhD position at the newly founded Frankfurt Institute of Advanced
Studies (FIAS). Unfortunately, there was no full 100% payment for PhD students at
this time and due to some financial obligations I had (I bought a car and all furnitures
for my first flat by the help of a credit), I had to reject the offer.
Neuroscience as a treasure. What do I want to tell you with this anecdote? First,
if you are searching for a PhD position or if you have an interesting idea for a master
thesis, do not hesitate to sent your ideas to persons which could have similar interests
and ideas and understand that you are the right person. Second, in my eyes, neuro-
science is a treasure and there are much more ideas that we can borrow from nature
to build better pattern recognition systems! There is now a common agreement in the
machine learning community that borrowing the idea of hierarchical representations
was the right step. The natural question arises: Are there any other cool ideas that
we could exploit?
”Deep Learning” is not a good name for the field. With this in mind, it is
clear, that the name ”Deep Learning” is actually not very appropriate if we think
CHAPTER 1. WHAT IS DEEP LEARNING? 16
Figure 1.6: An stronger interplay between neuroscience and machine learning could
help to harvest and understand more fruitful ideas than just the idea of hierarchical
representations and receptive fields used in CNNs.
of a new field in machine learning that 1. uses neurons as its basic computing units
and 2. tries to exploit more principles than just hierarchical representations from the
biological role model in the nearby future. What would be a better name for the field?
I have one. We could call the field ”Neural Networks”! Awesome, or?
no Neural Networks!”. It is interesting to see how quickly things changed in the last
years. In March 2017 I visited my old colleagues. They told me, that they now heat
their offices in the winter with Nvidia graphics cards + Deep Learning algorithms.
Is it cold? → python train cnn using imagenet on gpu.py
biological neurons.
Neural networks with neurons. In this book I want to introduce the topic of Deep
Learning and neural networks by adopting exactly the opposite position. The reasons
are mentioned above. In my eyes, the field of Deep Learning and neural networks
can make major advances if there is a continuous exchange between neuroscience and
machine learning. For this, the book will present also some interesting facts from
neuroscience and will not just present only technical neuron models.
CHAPTER 1. WHAT IS DEEP LEARNING? 19
Figure 1.7: In the supervised learning scenario for each input pattern (here: images of
numbers), a label is needed (”This is a four!”). In the unsupervised learning scenario
only input patterns are available.
2
Deep Learning:
An agile field
Enormous growth of interest. The diagram plots the data of two search terms
for the last 13 years. Support vector machines (SMVs) are the example for a classical
machine learning classifier. In 2004 search queries for ”support vector machine” were
much more frequently than for ”deep learning”. But around 2012 things turned and
in the last five years the interest in ”deep learning” has increased enormously world-
20
CHAPTER 2. DEEP LEARNING: AN AGILE FIELD 21
Figure 2.1: Strong growth of interest in DL worldwide between January 2004 and
September 2017. [Data source: Google Trends]
wide. In September 2017, ”deep learning” has been entered 18 times more often than
”support vector machine”.
US before Germany. This change did not happen at the same time everywhere in
the world. Compare Fig. 2.2 and Fig. 2.3. These figures show the frequencies of the
two search terms between September 2012 and September 2017 restricted to search
queries coming from the United States and Germany respectively. The comparison
shows that the topic of Deep Learning gained more interest than SVMs in the United
States around 2013, while the same change happened in Germany about a year later.
Computer Vision before NLP. Also the change did not happen in every machine
learning subfield at the same time. In the subfield of computer vision a DL model
presented by Alex Krizhevsky et al. [24] won the ImageNet Large Scale Visual Recogni-
tion Challenge (ILSVRC) of that year. In the subfield of Natural Language Processing
(NLP) the revolution came later as the following quote by Christopher Manning [31]
from Stanford University suggests:
”Deep Learning waves have lapped at the shores of computational linguis-
tics for several years now, but 2015 seems like the year when the full force
of the tsunami hit the major Natural Language Processing (NLP) confer-
ences.”
CHAPTER 2. DEEP LEARNING: AN AGILE FIELD 22
Figure 2.2: Strong growth of interest in DL in the US started around 2013. [Source:
Google Trends]
Figure 2.3: Strong growth of interest in DL in Germany started around 2014. [Source:
Google Trends]
CHAPTER 2. DEEP LEARNING: AN AGILE FIELD 23
Another personal anecdote. At the beginning of the year 2016 I was torn between
the idea of continuing my job as a professor and the idea of joining the Deep Learning
hype by accepting a job in industry. It is interesting to see that many other people
have similar ideas now. People at Quora ask, e.g., ”Should I quit my job and spend 3
years to learn machine learning and deep learning if I have about $200k in savings?”
or ”Is it worth it to quit my USD $150K software developer job to study machine learn-
ing and deep learning?” (neither I had $200k in savings nor a salary of $150k, so it
should have been easier for me to quit my job). When I stumbled upon the website of
a young start-up called MetaMind I became really interested. They developed Deep
Learning models for computer vision and Natural Language Processing (NLP) and
the CEO, Richard Socher, a former Stanford Ph.D. student of Andrew Ng, seemed
to be excellent in both fields. I finally applied. And indeed, I got an invitation for
a skype interview with Caiming Xiong and Richard Socher. In order to prepare for
the interview I went again to their website and read a notice: ”Metamind joins Sales-
force” which meant that this young start-up had been acquired by Salesforce in the
meantime. By the way, Richard gave me two weeks for a very interesting coding task
on sentiment analysis, but my job at the University was so demanding that I could
not find even a minute to work on it. The only time slot since I started my job as a
professor in my daily schedule that is not filled with work is 12 a.m. to 6 a.m. and
usually reserved for sleep (which is important for learning and forming memories). So
I took back my application. Nevertheless, Richard was very friendly and said I could
resume my application whenever I wanted. Very good style!
What do I want to tell you with this anecdote? Also at the level of companies
you can see that the field of Deep Learning is an extremely agile field. Maybe you
apply for a young start-up as well that will be acquired already in some months by
one of the big players.
Over 250 private companies focusing on artificial intelligence (AI) have been acquired
between 2012 and mid 2017 [6] by corporate giants like Apple, Intel, Google, IBM,
Samsung, Ford, GE, and Uber. Many of them use Deep Learning techniques.
Google
Google, e.g., acquired DNNResearch in 2013 for about $5 million [42]. DNNResearch
was a small Canadian start-up founded by three researchers: Prof. Geoffrey Hinton
from University of Toronto and two of his graduate students, Alex Krizhevsky and Ilya
CHAPTER 2. DEEP LEARNING: AN AGILE FIELD 24
Figure 2.4: A large number of Deep Learning start-ups have already been acquired by
the usual suspects.
CHAPTER 2. DEEP LEARNING: AN AGILE FIELD 25
Sutskever [26]. It allowed Google to offer its Google+ Photo search just six months
after the acquisition.
Another important acquisition by Google was DeepMind Technologies - a British com-
pany. Google paid about $650 million in 2013 [42]. Facebook was also interested in
DeepMind Technologies but Google won the race. DeepMind actually has no concrete
products, but focuses on AI research using Deep Learning techniques. Well known re-
search results are so called deep reinforcement learning algorithms for learning (Atari)
video games from scratch and the Neural Turing Machine: a neural network model
that is able to use a tape as an external memory, similar to the Turing machine model
used in theoretical computer science. In 2016 the company made headlines when its
program AlphaGo beat a human professional Go player. The game was invented in
ancient China more than 2500 years ago and none of the computer programs before
ever reached the professional play level.
In 2016 Google acquired the French visual search start-up Moodstock located in Paris
which focuses on providing DL based image recognition functionality on mobile devices
by the help of an API. Recently, Google bough Halli Labs, an only 4-month-old Indian
AI and machine learning (ML) startup, located in Bangalore.
There are a lot of other DL technology based companies Google acquired. The list is
really long [43]. Three further examples of Google’s acquisitions in 2014 are Jetpac
(city guide for over 6000 cities based on automatically analyzing Instagram photos)
and the two spin-offs from University of Oxford: Dark Blue Labs (specialized in DL
for understanding natural language) and Vision Factory (DL for visual object recog-
nition).
Intel
Three startups were acquired by Intel in 2016 alone:
Itseez, which was founded ”already” in 2005, developed a suite of DL based algorithms
to realize Advanced Driver Assistance Systems (ADAS), e.g., for detecting when a car
drifts from its lane or to brake automatically, when a pedestrian crosses the road.
Nervana was acquired by Intel for approximately $408 million [12]. Nervana developed
a DL framework called “Neon“, which is an open-source python based language and
set of libraries for developing deep learning models, but the acquisition also brought
talented chip designers to Intel, since Nervana was developing a custom ASIC - which
is called the Nervana Engine. The chip is optimized for DL algorithms by focusing on
operations heavily used by DL algorithms.
Movidius is specialized on low-power processor chips for computer vision and deep-
learning. One if its products that has been released in July 2017 is the ”Neural
Compute Stick” (NCS): an USB stick which contains Movidius’ Myriad 2 processor.
This is a Vision Processing Unit (VPU) that allows to accelerate vision and deep
learning algorithms. In contrast to a Graphics Processing Unit (GPU), a VPU lacks
CHAPTER 2. DEEP LEARNING: AN AGILE FIELD 26
specialized hardware for, e.g., rasterization and texture mapping, but provides direct
interfaces to grab data from cameras directly and allows for massive on-chip dataflow
between the parallel processing units.
Facebook
In 2012 Facebook announced its acquisition of the Israeli facial recognition company
Face.com. The company developed a platform for facial recognition in photos uploaded
via web and mobile applications. The technology continued in Facebook’s internal
project DeepFace, a deep learning facial recognition system created by a research
group at Facebook.
Mobile Technologies, the maker of the Jibbigo translation app, and a leader in speech
recognition and translation was acquired by Facebook in 2013. Jibbigo was the world’s
first speech-to-speech translation app when it was launched in October 2009.
In 2016 Facebook bought Zurich Eye, a spin-off of ten researchers of the ETH Zurich
that concentrates on developing a solution for providing accurate position information
for robots to navigate in indoor and outdoor environments using camera and inertial
measurement unit (IMU) data.
Apple
Perceptio was acquired by Apple in 2015. It used the DL approach to develop a
solution for smartphones, that allows phones to classify images without relying on
external data libraries, i.e., to be online.
In 2017 Apple paid around $200 million to acquire Lattice Data [11], a firm that
develops algorithms to turn unstructured data such as text and images into structured
data. Lattice Data is located in Menlo Park, California and tried to commercialize a
Stanford University research project known as ”DeepDive”.
Apple also bought the Israeli startup RealFace in 2017, a cybersecurity and machine
learning specialized in automatic face recognition.
memory).
Another personal anecdote. As a student I worked two days per week at the Uni-
versity of Bonn as a student assistant in different projects (Retina implant, Growing-up
robots) at the department of neuroinformatics. One day I helped to bring some stuff
to the basement and my diploma thesis supervisor said: ”Look Jürgen! This is an old
neurocomputer, called SYNAPSE. Prof. Anlauf from the other department helped to
build it.” This old neurocomputer was a piece of neuroscience history and now it stood
there with a lot of dust on it. What a pitty! He continued: ”If you want to speed-up
neural networks and you have 5 years, you can choose between two alternatives. Ei-
ther, you invest 5 years to develop a new special chip for speeding up neural networks
which is then restricted to some few models. Or you wait for 4 years, buy the newest
hardware at that time for a fraction of the costs and you have still a year to develop
any kind of neural network model in the remaining time on that hardware and the
resulting speed will be the same compared to the first alternative.” I am not sure,
whether my old diploma thesis supervisor is still right with his prediction. Regarding
the past I would say: he was right. But things seem to change now, since a big player
is investing a lot in his own AI accelerator platform. I will not give you the name of
this big player.
Google
The Tensor Processing Unit is being revealed. At the Google I/O in May 2016
it was made public that Google had designed a custom ASIC (Application-Specific
Integrated Circuit = ”a special chip”) that was developed specifically for speeding-up
their machine learning applications. Norm Jouppi, perhaps the most important head
during the development of this new chip, revealed in a post [18] at the same day that
these custom ASICs had been deployed in Google data centers already a year before
this Google I/O announcement. The new chip was named Tensor Processing Unit
(TPU) and it was also revealed, that AlphaGo was already powered by TPUs in the
matches against the Go world champion, Lee Sedol. So the Deep Learning algorithms
that won against the Go world champion already ran on special hardware.
Simple design. Fast development. Only recently [19] details have been published
about the internals of a TPU. The development time was extremely fast. The design,
built and deployment in data centers needed only 15 months. The TPU was designed
to be a co-processor similar to the old Floating Point-Units (FPUs), connected to the
CPU by a PCIe I/O Generation 3 bus with 16 lanes. In order to simplify the hardware
design and testing, it was decided that the TPU does not fetch the instructions by
itself from the memory. Instead, the CPU has to send the instructions to the TPU
into an instruction buffer. Internal function blocks in the TPU are connected by 256-
CHAPTER 2. DEEP LEARNING: AN AGILE FIELD 28
Figure 2.5: Schematic sketch of the function blocks of a Tensor Processing Unit (TPU).
TPUs are optimized for matrix operations as, e.g., multiplying a matrix of neuron
activations with a matrix of weights.
CHAPTER 2. DEEP LEARNING: AN AGILE FIELD 29
byte-wide paths!
Matrix Multiplication Unit. The heart of a TPU is the Matrix Multiplication Unit
(MMU). It contains a huge number of simple units which are called MAC (Multiplier-
Accumulate). A single MAC unit computes the product of two numbers and adds
that product to an accumulator. The MMU of a TPU contains 256x256=65536 of
such MACs that can perform 8-bit multiply-and-adds on signed or unsigned integers.
Since matrix multiplications and convolutions are currently the most important op-
erations for deep neural network models, the design of the TPU was optimized to
speed-up exactly these operations.
Data flow. The input for the MMU are two memory structures, called ”weights” and
”unified buffer” (e.g. containing activation values of neurons). The 16-bit product
outputs of the MACs are stored in a third memory structure, called ”accumulators”.
These outputs are then transmitted to a unit called ”activation” which can perform
the non-linear transfer function of a neuron (such as ReLU, sigmoid, etc.) and pooling
operations. The results are then stored in the unified buffer again.
TPU block diagram. A simplified schematic block diagram of a TPU can be found
in Fig. 2.5. Note that according to the floor plan of the TPU die, the most chip area
is taken by the MMU (24% of the chip area) and the ”unified buffer” (29% of the chip
area) for storing neuronal activations.
2nd generation TPUs and Cloud TPUs. The second generation of Google’s TPUs
were presented in May 2017 [7]. While the first generation TPUs were designed for
inference (application of already trained models) and not for training of new models,
the second generation TPUs were designed for both training and inference. A single
2nd generation TPU now provides 180 TFLOPs (180 TFLOPs = 180.000.000.000.000
= 180 trillions of floating point operations / second) and can be connected to a ”pod”
of 8x8 TPUs, which provides even 11,5 PFLOPs (= 11.500.000.000.000.000 = 11.5
quadrillions of floating point operations / second). These new TPUs are now also
available for the public through the Google Compute Engine and are called Cloud
TPUs.
Nvidia
Many chip manufacturers want to have a piece of the cake with the title ”Deep Learn-
ing boom”. Nvidia announced in May 2017 [10] to offer a new computing platform
soon, called Nvidia Tesla V100, that provides 120 TFLOPs [10]. For embedded sys-
tems such as robots and drones, the Nvidia Jetson TX1 and the new TX2 computing
boards can be used. Since an important application for DL algorithms are Advanced
CHAPTER 2. DEEP LEARNING: AN AGILE FIELD 30
Driver Assistance Systems (ADAS), Nvidia also tries to offer products for this market
segment with the help of the Nvidia Drive PX and the new Xavier AI Car Supercom-
puter board announced in January 2017.
Theano. Theano is named after a Greek mathematician, who may have been Pythago-
ras’ wife. Technically, Theano is actually not a machine learning library, but a numer-
ical computation library for Python which makes use of NumPy (Python library for
support of large, multi-dimensional arrays and matrices). Theano has been developed
at Yoshua Bengio’s machine learning lab at the University of Montreal with the goal
to support rapid development of efficient machine learning algorithms. It is possible
to use another library, called Keras on top of Theano.
Keras. Keras can be seen as a comfortable interface (a front-end) for machine learn-
ing programmers which can work with many different machine learning libraries as
back-end. Currently Keras provides support for using MXNet, Deeplearning4j, Ten-
sorflow, CNTK or Theano as a back-end.
TensorFlow. TensorFlow was developed by the Google Brain team, first only for
internal use at Google. Before, the closed-source library DistBelief was used. Tensor-
Flow was then released under the Apache 2.0 open source license on 9 November 2015.
Version 1.0.0 was released on 11 February 2017. However, although it is not even two
CHAPTER 2. DEEP LEARNING: AN AGILE FIELD 31
years old at the time of writing this text, TensorFlow is now one of the most popular
GitHub repositories and perhaps the most widely used library for Deep Learning with
a lot of examples in the web.
Torch and PyTorch. Torch is a machine learning library with an API written in
Lua. It is used by large companies such as Facebook and Twitter. Facebook open-
sourced PyTorch, a Python API for Torch, in January 2017. It is therefore quite new
and still in the Beta stadium.
3
The biological role model: The
Neuron
32
CHAPTER 3. THE BIOLOGICAL ROLE MODEL: THE NEURON 33
brain. Each day new brain structures were dissected by the course participants and
their presumed functions. Seeing and holding the different brain structures directly
in my hands was a profound experience. It was like opening the tower of my first
personal computer when I was a pupil and observing these many cables and electronic
components on the motherboard. You do not really know what they do. Nevertheless,
I have the feeling that it helps to see and touch things in order to start the process of
understanding. However, I will never forget the obtrusive smell of formalin that was
used to preserve the brains. Open desktop towers really smell better...
Figure 3.1: This is me as a student in 2001, taking part in a hands-on course on brain
structures at the University of Frankfurt. Yes, it is a real neural network that I am
holding in my hands! [Source: private photo archive]
Neurons are quite small. There are a lot of brain structures that can be distin-
guished already by the human eye without using a microscope. Nevertheless, all these
different brain structures use the same building block: Neutrons! ... Uhm ... Neurons!
You cannot see these neurons with your own eyes. They are too small. You need a
CHAPTER 3. THE BIOLOGICAL ROLE MODEL: THE NEURON 34
Figure 3.2: Some more impressions from my first brain dissection [Source: private
photo archive]
microscope. However, there are very different neuron types. Depending on the neu-
ron type the size and form varies largely. The most important part of a neuron, the
neuron’s soma (cell body), can have diameters between 5 µm and 100 µm (remember:
1000 µm = 1 mm). Take a tape measure and observe how large 1 mm is. In a cube
of this side length (1 mm3 ) you can find roughly between 20.000-40.000 neurons (see
[40], p.18) depending on the brain region you consider! See Fig. 3.3 to be impressed
how dense neurons can be packed due to their small size.
Neuronal tissue. These neurons connect to each other and thereby build a dense
tissue. The part of the tissue that we can see with our own eyes and that is gray
are mainly the cell bodies of the neurons (gray matter), while the white looking part
of this tissue are the axons (white matter). This neuronal tissue is able to do all the
magic that we would like to do with computers but what we cannot nowadays. Imag-
ine that this tissue allows you to enter visual information to some of these neurons
in that piece and it will compute whether it is the face of your mother or your girl
CHAPTER 3. THE BIOLOGICAL ROLE MODEL: THE NEURON 35
Figure 3.3: Neurons are very small. Therefore a large number of neurons fit into small
volumes.
friend. Imagine that another piece of tissue allows to output electrical signals that
will activate muscles in our arms such that we can play a melody on a piano. Imagine
that some other piece of neural tissue allows to analyze smells (”Hey! This smells like
formalin”) and another piece associates smells with experiences (”Hands-on workshop
on brain structures”).
(large biomolecules) that are needed for the correct functioning of the nerve cell. The
axon is a special extension of the soma that starts at the so called axon hill and typ-
ically branches out similar as the dendritic tree. The endings of the axons are called
axon terminals.
Threshold potential and action potential. The input signals for a neuron come
in as electrical signals at different positions of the dendritic tree. These individual
electrical signals flow towards the soma and change the membrane potential of the
cell. The incoming signals can increase or decrease this membrane potential and these
changes sum up if the signals come in approximately at the same time. Now, some-
thing very interesting happens if the membrane potential is decreased such strongly
that it reaches a value between approximately -50 mV and -55 mV! When this so
called threshold potential is reached, voltage-gated ion channels will open and initiate
a characteristic flow of ions. The membrane potential will change in a characteristic
manner and will always show the same curve of change as depicted in Fig. 3.5. This
signal will move along the axon into the direction of its endings, the synapses. It is
called an action potential, or nerve impulse and mostly spike (due the spiky form of
the curve of the change in the membrane potential). We also say: the neuron fires.
Action potentials always show the same form. Note, that this signal always
has the same form! Thus it is actually a binary signal: Either the signal is present or
not. However, the firing rate, i.e., the number of spikes a neuron fires per second, can
be very differently. It cannot be arbitrary large, since there is a refractory period in
which the neuron cannot generate a new action potential. Since an action potential
spans a time period of about 3-4ms, maximum firing rates of about 250-330 spikes /
second result.
CHAPTER 3. THE BIOLOGICAL ROLE MODEL: THE NEURON 38
Figure 3.5: The action potential of a neuron is a characteristic change of the membrane
potential that automatically happens if the neuron’s membrane potential is decreased
below the the threshold potential. It always shows this form! This is the signal which
neurons send to subsequent neurons.
CHAPTER 3. THE BIOLOGICAL ROLE MODEL: THE NEURON 39
3.4 Synapses
The connection between two neurons. Synapses are the connections between
two neurons, see Fig. 3.6. However, there is a small room between the ending of the
synapse of the sending neuron (presynaptic neuron) and the contact point of the re-
ceiving neuron (postsynaptic neuron), e.g., some location on the dendritic tree of the
receiving neuron. This small room is called the synaptic cleft. This is true for the
most synapses which are called chemical synapses. However, there is also another type
of synapse - the electrical synapse - which works differently.
1. The action potential reaches the end of the axon of the presynaptic neuron, the
axon terminal. It opens ion channels that allow positively charged Ca2+ ions to
enter.
2. Due to the increase of ions inside the axon terminal vesicles with neurotransmit-
ters fuse with the membrane of the axon terminal. The neurotransmitter flow
out of the vesicles into the synaptic cleft.
So the electrical signal (the action potential) at the presynaptic neuron will be con-
verted into a chemical one that passes the synaptic cleft and will then result in a new
electrical signal in the postsynaptic neuron.
Effect on postsnaptic neuron. The effect upon the postsynaptic neuron can be
excitatory, a decrease of the membrane potential, or inhibitory, an increase in the
membrane potential. A decrease of the membrane potential is also called an Excita-
tory PostSynaptic Potential (EPSP), whereas an increase in the membrane potential
is called an Inhibitory PostSynaptic Potential (IPSP). However, whether the release of
the neurotransmitters will lead to an EPSP or an IPSP is not determined by the presy-
naptic neuron or by the neurotransmitter, but by the type of receptor that is activated.
CHAPTER 3. THE BIOLOGICAL ROLE MODEL: THE NEURON 40
Figure 3.6: Structure and processes of a chemical synapse - the connection between
two neurons.
CHAPTER 3. THE BIOLOGICAL ROLE MODEL: THE NEURON 41
Potentation vs. Depression. If the presynaptic and the postsynaptic neuron fire
approximately at the same time, the impact of future action potentials of the presy-
naptic neuron onto the postsynaptic neuron can grow in the sense that the EPSP or
IPSP will become stronger. This process is called potentiation. If pre- and postsynap-
tic neurons do not fire simultaneously, this impact can also become weaker, which is
called depression.
The synaptic weight as data storage. The impression arises that nearly all pa-
CHAPTER 3. THE BIOLOGICAL ROLE MODEL: THE NEURON 42
Structural plasticity. However, the term neuronal plasticity does not only subsume
synaptic plasticity processes, but also structural changes. E.g., the so called dendritic
spines, the locations at the postsynaptic neurons that can be contacted by the axon
terminals of the presynaptic neurons, can change their size. There can even be new
synapses generated (synaptogenesis) or pruned (synaptic pruning) if synapses are not
used. Axons can sprout new nerve endings.
Hebbian learning as a key idea. An early theory regarding this synaptic learning
procedure stems from Donald Olding Hebb. Hebb was a professor for psychology at
McGill University in Montreal / Canada. In his book ”The Organization of Behavior:
A Neuropsychological Theory” [16] Hebb wrote:
This important idea how learning at synapses could work influenced and inspired many
neuroscientist that came after Hebb. It is called Hebbian theory, Hebb’s postulate
(there were no clear physiological evidences for this postulate at his time), or simply
Hebb’s rule.
On page 70 Hebb formulates the idea with other words:
The general idea is an old one, that any two cells or systems of cells that
are repeatedly active at the same time will tend to become ”associated”, so
CHAPTER 3. THE BIOLOGICAL ROLE MODEL: THE NEURON 43
Figure 3.7: Neuronal plasticity can mean changes of the synaptic transmission
strength, but also structural changes as new synapses or axonal sprouting.
Hebb was really good in generating postulates in 1949. He also postulated the follow-
ing:
CHAPTER 3. THE BIOLOGICAL ROLE MODEL: THE NEURON 44
The most obvious and I believe much the most probable suggestion concern-
ing the way in which one cell could become more capable of firing another
is that synaptic knobs develop and increase the area of contact between the
afferent axon and efferent soma (”Soma” refers to dendrites and body, or
all of the cell except its axon)
(page 62 in his book [16])
As we know from the section about neuronal plasticity above, Hebb’s postulate was
correct: synapses can indeed grow and even new synapses can develop.
The synaptic weight change curve. It needed some decades for the development
of neuroscientific measurement techniques to proof Hebb’s postulate. But today we
know, that Hebb’s postulate is correct. In a work in 2001 [3] the exact synaptic
weight change curve could be measured. It was shown that the relative timing of the
spikes that occur at the presynaptic neuron and the postsynaptic neuron is crucial:
If the postsynaptic neuron fires shortly after the presynaptic neuron has fired, the
synaptic strength will be increased. If it is the reverse, i.e., the presynaptic neuron
fires shortly after the postsynaptic neuron has fired, the synaptic strength will be
decreased. Further, if the time difference between the pre- and postsynaptic spikes is
more than ca. 80 ms, the strength of the synapse will not change.
CHAPTER 3. THE BIOLOGICAL ROLE MODEL: THE NEURON 45
Wrong intuitions are taught. In popular science presentation sources we can often
read a wrong description of the function of a neuron:
”Neurons are a special type of cell with the sole purpose of transferring
information around the body” [20]
46
CHAPTER 4. THE MANY FACES OF A NEURON 47
Such wrong statements unfortunately lead the understanding of the reader into a
wrong direction (”Aha! So a neuron is a cable for signals!”). Ask yourself: If all these
86 billion neurons act as cables in your brain and just transfer signals from one neuron
to the next, where does the actual computation happen?
A neuron is not just an AND gate. It seems that a neuron is very similar to a
logical AND gate, right? In the above example neuron 4 only fires,
However, we could also imagine another set of synaptic weights such that neuron 4
already fires if only two of the neurons 1-3 spike. In this case, neuron 4 would detect
also other spatial patterns than just a ”seven”. Then the logical function that neuron
4 realizes could be described similar to:
CHAPTER 4. THE MANY FACES OF A NEURON 48
But we could also imagine another setting (see Fig. 4.2) with such strong synaptic
weights between neurons 1-3 and neuron 4 such that a large firing rate of only one of
these three neurons 1-3 could already be sufficient to let neuron 4 spike. For this, it
is more appropriate to think of the neuron as a spatial pattern (or feature) detector
that accumulates evidence from different sources where the evidence is a continuous
value (encoded by the firing rate) and the pattern detection result is encoded as a
continuous value (the firing rate) as well. In this sense, a neuron can be thought of an
evidence detector, which can signal evidence for a certain pattern if enough conditions
are met.
CHAPTER 4. THE MANY FACES OF A NEURON 49
Figure 4.1: Neurons 1,2 and 3 fire, if certain vertical or horizontal lines are visible.
However, neuron 4 only fires if neuron 1,2 and 3 fire roughly at the same time. Thereby
neuron 4 is able to detect a spatial pattern that corresponds to the number seven.
CHAPTER 4. THE MANY FACES OF A NEURON 50
Figure 4.2: Another setting in which a single input neuron 1-3 that fires frequently
can already result in a spiking neuron 4. If one of the input neurons fires only with a
low firing rate, neuron 4 will not be excited. However, If two of the input neurons fire
at a low firing rate, neuron 4 will be excited.
CHAPTER 4. THE MANY FACES OF A NEURON 51
Figure 4.3: Neuron 4 integrates sensoric information from three different sensor modal-
ities. In order to be sure that there is a coffee, it could mean that we need to detect
the simultaneous firing of neurons 1-3, which corresponds to the evidence for the coffee
concept from visual, smell and haptic modalities.
Associative learning. Now some days later you stand up in the morning, arrive in
CHAPTER 4. THE MANY FACES OF A NEURON 52
the kitchen and for the first time your wife made a coffee for you. You see it, but
you can also smell it and you can even touch the warm coffee cup with your hands.
The visual evidence excites neuron 4 and due to the fact that neuron 2 (encoding
the smell) and neuron 3 (encoding some haptic information) fire at the same time,
associative learning happens! The connection between neuron 2 and neuron 4 will
be strengthened and also the connection between neuron 3 and neuron 4 due to the
synaptic plasticity learn curve discussed before. Imagine you come to the breakfast
the next morning and you do not see where the coffee cup is standing. It might be
that the synaptic strength between neuron 2 and neuron 4 is already large enough to
excite neuron 4: You can not see the coffee, you do not hold it in your hands, but
already the smell is enough to associate it with the concept of a coffee (neuron 4).
This is a simplistic example. Please note that this is a highly simplistic descrip-
tion using only 4 neurons. Most neuroscience researchers will probably agree that the
visual information that encodes a coffee cup, the information that encodes the smell
and the sensation of holding a coffee cup in the hands and the concept of a coffee cup
will not be encoded by just the firing of a single neuron each. The hypothetical cell
that would fire if only one specific object (person, face, etc.) is perceived are called
grandmother cells. Most neuroscientists believe that such cells do not exist. Instead
cell assemblies (neural ensembles) are supposed to encode these concepts using many
neurons, while their characteristic firing pattern is called a population code.
Temporal detection. However, the weights between neuron 1-3 to neuron 4 could
also be set in another way, such that neuron 4 only fires if neuron 1-3 fire at the same
time. In this case, the evidences stemming from the three different sensor modalities
all need to be there, in order to conclude that there is a coffee cup. This means that
neuron 4 would act a temporal coincidence detector.
Firing rate. The model makes an important simplification compared to the biolog-
ical neuron model which is a strong assumption: It does not model at which point
in time a neuron spikes, but only how often it spikes within some time frame, i.e., it
”only” models the firing rate of a neuron. The firing rate is modeled by the help of a
CHAPTER 4. THE MANY FACES OF A NEURON 53
Synaptic weights. Further, the model abstracts from geometric features of the neu-
ron: there is no modeling where a synapse is located or how long dendrites and axons
are. Only the synaptic transmission strength is modeled using the help of weights
wi . Since the strength of an EPSP or an IPSP are not only the result of the synaptic
transmission strength, but also depend on the firing rate of the presynaptic neuron,
the influence on the membrane potential act by an individual presynaptic neuron is
modeled by oi ∗ wi , i.e., the higher the firing rate of the presynaptic neuron and the
higher the synaptic weight, the larger is the influence onto the membrane potential of
the postsynaptic neuron. Note, that synaptic weights are modeled by real numbers
and can therefore be negative as well, corresponding to inhibitory synapses. The value
act is the model for the current value of the membrane potential. Since in real neu-
CHAPTER 4. THE MANY FACES OF A NEURON 54
rons the membrane potential results by EPSP and IPSP signals induced from several
different presynaptic neurons,P
act is computed by the weighted sum of the firing rates
of all sending neurons: act = N i=1 oi wi .
Transfer function. But why do we need this function f that takes the activation
value act as an argument? The transfer function is used to model actually two dif-
ferent aspects of the biological role model. First, the fact that real neurons do not
fire until the threshold potential is reached. Second, the fact that real neurons cannot
show negative or arbitrarily high firing rates. Remember that after a neuron has fired,
there is a refractory period in which the resting potential is recovered and in which
the neuron cannot fire. We said that due to the time a spike needs and due to the
time of the refractory period, usually a maximum firing rate of about 300 spikes /
second results. For this, Perceptrons are often used with firing rates limited to some
interval [0, max], which can be achieved by using some transfer function f that maps
the sum of the weighted inputs - the activation act - to exactly this interval, even for
extreme large absolute values of act. Fig. 4.5 shows two transfer functions that take
these considerations into account.
and realizes a ramp with the same slope everywhere on the ramp.
It is called the step function (Heaviside) and is depicted in Fig. 4.6 and is defined by:
(
0 , act ≤ T
out = f (x) = (4.3)
1 , act > T
Note that the usage of the step transfer function means that the output of a neuron
is modeled by just a binary value. Either it fires (1) or it fires not (0).
CHAPTER 4. THE MANY FACES OF A NEURON 55
act = o • w (4.5)
where o = (o1 , o2 , ..., oN ) is the vector of the firing rates of the presynaptic neurons
and w = (w1 , w2 , ..., wN ) is the vector of the corresponding weights. The operation • is
called dot product or scalar product which takes two vectors of equal lengths (sequences
of numbers) and maps it to single real number.
Now, this inner product is exactly what is computed as the basic operation in image
processing if you want, e.g., to blur or sharpen an image or detect edges. The weights
are then written in a two-dimensional arrangement and are called a kernel, filter
kernel or convolution matrix K. The pixel values of the image region that is currently
processed are also written in a two-dimensional arrangement I. Again, the result of
this so called convolution is a real number which is then used as a new pixel value.
For this it is also reasonable to compare the function of a neuron with a technical
filter, see Fig.4.8.
CHAPTER 4. THE MANY FACES OF A NEURON 56
Figure 4.5: Two transfer functions that make sure, that 1. the activation act of a
neuron is never mapped to a negative firing rate and 2. the firing rate is limited to
some maximum value.
CHAPTER 4. THE MANY FACES OF A NEURON 57
Figure 4.6: The step function maps all activations act of a neuron either to a zero
firing rate or a firing rate of 1.
CHAPTER 4. THE MANY FACES OF A NEURON 58
1 # ---
2 # Python code to generate the logistic transfer function plot
3 # ---
4
5 import numpy as np
6 import matplotlib.pyplot as plt
7
8 # Define the logistic transfer function
9 def f(x,gamma):
10 return 1.0 / (1.0 + np.exp(-gamma*x))
11
12 # Prepare a vector of x-values
13 x = np.arange(-4.0, 4.0, 0.01)
14
15 # Set title of the plot
16 fig = plt.figure()
17 fig.suptitle(’Logistic transfer function’, fontsize=20)
18
19 # Set y-range to display, use a grid, set label for axes
20 plt.ylim(-0.25, 1.25)
21 plt.grid(True)
22 plt.xlabel(’act’, fontsize=14)
23 plt.ylabel(’out = f(act)’, fontsize=14)
24
25 # Plot 3 versions of the logistic transfer function
26 # using different values for gamma
27 plt.plot(x, [f(b,1.0) for b in x], ’r’)
28 plt.plot(x, [f(b,2.0) for b in x], ’b’)
29 plt.plot(x, [f(b,10.0) for b in x], ’g’)
30
31 # Show arrows with annotation text which gamma value
32 # was used for which graph
33 plt.annotate(’gamma = 1.0’, xy=(1, 0.72), xytext=(2, 0.5),
34 arrowprops=dict(facecolor=’red’, shrink=0.01))
35
36 plt.annotate(’gamma = 2.0’, xy=(0.5, 0.72), xytext=(1.5, 0.3),
37 arrowprops=dict(facecolor=’blue’, shrink=0.01))
38
39 plt.annotate(’gamma = 10.0’, xy=(0.1, 0.8), xytext=(-3, 1.0),
40 arrowprops=dict(facecolor=’green’, shrink=0.01))
41
42 # Generate image file
43 fig.savefig(’tf_logistic.png’)
44
45 # Show the plot also on the screen
46 plt.show()
47
Figure 4.7: Python code I used to generate the plot of the logistic transfer function.
CHAPTER 4. THE MANY FACES OF A NEURON 59
Figure 4.8: A neuron can also be considered as performing a filtering operation known
from image processing. In this view, the synaptic weights correspond to the values of
the filter kernel matrix while the firing rates of the sending neurons correspond to the
pixel values of the image region that is filtered.
CHAPTER 4. THE MANY FACES OF A NEURON 60
Non-spiking neuron models / Firing rate models. On this spectrum, the Per-
ceptron neuron model can be considered as a very abstract model. It is a so called
non-spiking neuron model or firing rate model: not the time points are simulated at
which a neuron spikes, but its average spike rate within some time window, which
is called its firing rate. However, the Perceptron neuron model is not the only non-
spiking neuron model. As you will see in a later chapter, a simple neural network
model which is called the Self-Organizing Map does not use Perceptrons, but ”neu-
rons”, which can compute a distance between its inputs (input vector) and a stored
vector.
Spiking neuron models. If also time points are modeled at which neurons spike, we
are in the domain of the so called spiking neurons. However, the models in this domain
further differ regarding the question whether neurons are simply modeled as points or
whether geometrical aspects (size, detailed 2D or 3D structure of dendrites and axon)
are modeled as well. In the first case, we are talking about point based models or single
compartment models. These models ignore the morphological structure of neurons. In
the second case, we are talking about multi-compartment models which are considered
as one of the most detailed set of models for biological neurons.
mentioned above, a biological plausible form of the action potential is achieved auto-
matically with the help of a set of differential equations, the form of an action potential
cannot be simulated with this model. Instead, if the threshold potential is reached, a
”spike” is said to happen and the simulated membrane potential is reset to the resting
potential and the summation of incoming signals starts again from this reseted resting
potential.
Temporal coding. In contrast, the temporal coding idea suggests that the exact time
points or at least the time between subsequent action potentials - the Inter-Spike Inter-
val (ISI) or Inter-Pulse Interval (IPI) - is not a result of irregularities, but important
and encodes information as well! So according to temporal coding the two spike trains
T 1 = 1000111001 and T 2 = 0101101100 (1=spike, 0=no spike) would encode a differ-
ent information, while according to rate coding it would mean the same information
since both spike trains show a firing rate of 5 spikes / interval, i.e., the same firing rate.
Which code is used? Since in motor neurons, e.g., the strength at which an inner-
vated muscle is flexed depends solely on the firing rate, we can say that rate codes
are used in the brain. But are used temporal codes as well? Yes! E.g., flies have so
called H1 neurons which are located in the visual cortex. These neurons respond to
horizontal motions in the visual field of the fly and help to stabilize their flight by ini-
tiating motor corrections. Interestingly, it was possible to reconstruct the movements
seen by the fly using measurements of just the inter-spike intervals. This can be seen
as proof that temporal coding is used in brains as well.
CHAPTER 4. THE MANY FACES OF A NEURON 62
Figure 4.9: While in Deep Learning mainly Perceptron neuron models are used, many
more neuron models are used in neuroscience which are also biologically more plausible.
However, this ”more” of plausibility comes with higher simulation costs due to the
higher complexity of the models.
CHAPTER 4. THE MANY FACES OF A NEURON 63
Figure 4.10: Information is encoded by neurons using rate codes and temporal codes.
5
The Perceptron
64
CHAPTER 5. THE PERCEPTRON 65
by mechanical operations)!
Figure 5.1: The ”Mark 1 Perceptron” was a first neuro computer that was able to
classify images of 26 different letters of different typefaces with a correct classification
rate of 79% (see [15], p.1-3). And this already at the end of the 1950s! The ”Mark 1
Perceptron” contained 512 association units and 8 response units, which allowed for
28 = 256 different output patterns.
The response units are two state-devices which emit one output if their
inputs is positive and a different output if their input is negative. [15]
So the corresponding neuron model for the ”Response units” corresponds to a neuron
that computes its activity as the weighted sum of its inputs and then maps this activity
CHAPTER 5. THE PERCEPTRON 66
Case 1: Output is correct. If out = 1 and t = 1 (or out = 0 and t = 0), the weights
seem to be ok. So we should make sure, that the weights are not changed.
CHAPTER 5. THE PERCEPTRON 67
Figure 5.2: The ”bias trick”. In order to let a learning algorithm not only learn the
weights wi , but also learn the threshold T of the ramp function, we establish a new
input weight w0 and a new input o0 which is always 1. This new input is called a
”bias”.
wi = wi + ∆wi with:
Fast code walk-through. The code below starts with reading in the MNIST data
using the functionality of TensorFlow. It then displays some of the MNIST 784 long
input vectors by reshaping them to 2D matrices of size 28x28. The actual generation
and learning of the Perceptron can be found in the function
generate and train perceptron classifier.
Here a 2D matrix of size 785x10 is generated. Why 785x10? The 28x28 input images
considered as 1D input vectors have actually length 784. But remember, that we need
an additional bias input which is always ”1”. For this, the input vectors have length
785. The 10 is the number of output vectors. We use one output vector to encode each
of the 10 classes (digits ”0”-”9”) that are present in this classification problem. Can
you find the actual computation of the output values of these 10 neurons? It can be
found in line 139 and 140. Line 139 computes the activity for each of the 10 neurons.
Then in line 140 the activations are mapped using the ramp transfer functions.
CHAPTER 5. THE PERCEPTRON 69
Figure 5.3: The Perceptron learning rule is a straightforward approach to adjust the
weights into the correct direction by comparing the desired output t with the actual
output out. Note that the inputs oi are assumed to be positive.
CHAPTER 5. THE PERCEPTRON 70
Figure 5.4: The computations in a Perceptron can be modeled using a matrix multi-
plication of the input and weight matrix and applying a transfer function f element
wise on the resulting activation vector.
16 import numpy as np
17
18
19 ’’’
20 The ramp transfer function
21 ’’’
22 def f(x):
23 if (x<=0):
24 return 0
25 else:
26 return 1
27
28 f = np.vectorize(f)
29
30
31 ’’’
32 Donwload & unpack the MNIST data
33 Also prepare direct access to data matrices:
34 x_train, y_train, x_test, y_test
35 ’’’
36 def read_mnist_data():
37
38 # 1. download and read data
39 mnist = input_data.read_data_sets("MNIST_data/", one_hot=True)
40
41 # 2. show data type of the mnist object
42 print("type of mnist is ", type(mnist))
43
44 # 3. show number of training and test examples
45 print("There are ", mnist.train.num_examples,
46 " training examples available.")
47 print("There are ", mnist.test.num_examples,
48 " test examples available.")
49
50 # 4. prepare matrices (numpy.ndarrays) to
51 # access the training / test images and labels
52 x_train = mnist.train.images
53 y_train = mnist.train.labels
54 x_test = mnist.test.images
55 y_test = mnist.test.labels
56 print("type of x_train is", type(x_train))
57 print("x_train: ", x_train.shape)
58 print("y_train: ", y_train.shape)
59 print("x_test: ", x_test.shape)
60 print("y_test: ", y_test.shape)
61
62 return x_train,y_train,x_test,y_test
63
64
65
CHAPTER 5. THE PERCEPTRON 72
66 ’’’
67 This function will show n random example
68 images from the training set, visualized using
69 OpenCVs imshow() function
70 ’’’
71 def show_some_mnist_images(n, x_train, y_train):
72
73 nr_train_examples = x_train.shape[0]
74
75 for i in range(0,n):
76
77 # 1. guess a random number between 0 and 55.000-1
78 rnd_number = randint(0, nr_train_examples)
79
80 # 2. get corresponding output vector
81 correc_out_vec = y_train[rnd_number,:]
82 print("Here is example MNIST image #",i,
83 " It is a: ", np.argmax(correc_out_vec))
84
85 # 3. get first row of 28x28 pixels = 784 values
86 row_vec = x_train[rnd_number, :]
87 print("type of row_vec is ", type(row_vec))
88 print("shape of row_vec is ", row_vec.shape)
89
90 # 4. reshape 784 dimensional vector to 28x28
91 # pixel matrix M
92 M = row_vec.reshape(28, 28)
93
94 # 5. resize image
95 M = cv2.resize(M, None, fx=10, fy=10,
96 interpolation=cv2.INTER_CUBIC)
97
98 # 6. show that matrix using OpenCV
99 cv2.imshow(’image’, M)
100
101 # wait for a key
102 c = cv2.waitKey(0)
103
104 cv2.destroyAllWindows()
105
106
107 ’’’
108 Generate a weight matrix of dimension
109 (nr-of-inputs, nr-of-outputs)
110 and train the weights according to the
111 Perceptron learning rule using random sample
112 patterns <input,desired output> from the MNIST
113 training dataset
114 ’’’
115 def generate_and_train_perceptron_classifier\
CHAPTER 5. THE PERCEPTRON 73
116 (nr_train_steps,x_train,y_train):
117
118 nr_train_examples = x_train.shape[0]
119
120 # 1. generate Perceptron with random weights
121 weights = np.random.rand(785, 10)
122
123 # 2. do the desired number of training steps
124 for train_step in range(0, nr_train_steps):
125
126 # 2.1 show that we are alive from time to time ...
127 if (train_step % 100 == 0):
128 print("train_step = ", train_step)
129
130 # 2.2 choose a random image
131 rnd_number = randint(0, nr_train_examples)
132 input_vec = x_train[rnd_number, :]
133 # add bias input "1"
134 input_vec = np.append(input_vec, [1])
135 input_vec = input_vec.reshape(1, 785)
136
137 # 2.3 compute Perceptron output.
138 # Should have dimensions 1x10
139 act = np.matmul(input_vec, weights)
140 out_mat = f(act)
141
142 # 2.4 compute difference vector
143 teacher_out_mat = y_train[rnd_number, :]
144 teacher_out_mat = teacher_out_mat.reshape(1, 10)
145 diff_mat = teacher_out_mat - out_mat
146
147 # 2.5 correct weights
148 learn_rate = 0.01
149 for neuron_nr in range(0, 10):
150
151 # 2.5.1 get neuron error
152 neuron_error = diff_mat[0, neuron_nr]
153
154 # 2.5.2 for all weights to the current
155 # neuron <neuron_nr>
156 for weight_nr in range(0, 785):
157
158 # get input_value x_i
159 x_i = input_vec[0, weight_nr]
160
161 # compute weight change
162 delta_w_i = learn_rate * neuron_error * x_i
163
164 # add weight change to current weight
165 weights[weight_nr, neuron_nr] += delta_w_i
CHAPTER 5. THE PERCEPTRON 74
166
167
168 # 3. learning has finished.
169 # Return the result: the 785x10 weight matrix
170 return weights
171
172
173
174 ’’’
175 Now test how good the Perceptron can classify
176 on data never seen before, i.e., the test data
177 ’’’
178 def test_perceptron(weights, x_test, y_test):
179
180 nr_test_examples = x_test.shape[0]
181
182 # 1. initialize counters
183 nr_correct = 0
184 nr_wrong = 0
185
186 # 2. forward all test patterns,
187 # then compare predicted label with ground
188 # truth label and check whether the prediction
189 # is right or not
190 for test_vec_nr in range(0, nr_test_examples):
191
192 # 2.1 get the test vector
193 input_vec = x_test[test_vec_nr, :]
194 # add bias input "1"
195 input_vec = np.append(input_vec, [1])
196 input_vec = input_vec.reshape(1, 785)
197
198 # 2.2 get the desired output vector
199 teacher_out_mat = y_test[test_vec_nr, :]
200 teacher_out_mat = teacher_out_mat.reshape(1, 10)
201 teacher_class = np.argmax(teacher_out_mat)
202
203 # 2.3 compute the actual output of the Perceptron
204 act = np.matmul(input_vec, weights)
205 out_mat = f(act)
206 actual_class = np.argmax(out_mat)
207
208 # 2.4 is desired class and actual class the same?
209 if (teacher_class == actual_class):
210 nr_correct += 1
211 else:
212 nr_wrong += 1
213
214 # 3. return the test results
215 correct_rate =\
CHAPTER 5. THE PERCEPTRON 75
y = mx + b (5.6)
and it represents a straight line with the help of a slope m and the y-intercept b. x
and y are the free variables here, while m and b are the line parameters.
Figure 5.5: The coordinate form of a straight line uses three parameters to describe
the line. It can be transformed into the well-known standard form of a straight line
which describes a line using a slope and a y-intercept.
ax + by = c (5.7)
where a, b, c are the line parameters, where a and b cannot be both zero and x,y are
the free variables. The straight line is here defined by all points (x, y) that satisfy this
equation.
Assume that b 6= 0. Then we can transform the coordinate form into the standard
CHAPTER 5. THE PERCEPTRON 77
form:
ax + by = c
⇔ by = −ax + c
⇔ y = − ab x + cb (5.8)
Thus we get the general form from the coordinate form with a slope corresponding to
− ab and the y-intercept corresponding to cb .
n
X
ai wi ≤ wn+1 (5.9)
i=1
Xn
bi wi > wn+1 (5.10)
i=1
The set of all points x = (x1 , ..., xn ) ∈ Rn for which the equation
n
X
xi wi = wn+1 (5.11)
i=1
holds is called a separating hyperplane.
In 2D two sets A, B are linearly separable if we can draw a line between the points
of the two sets, such that all points of A are on one side of the line and all points
from B are on the other side. In math, the two sides are called ”half-spaces”. Each
hyperplane divides the corresponding space into two half-spaces.
Figure 5.6: Two sets of points (set with ”-” points and set with ”+” points) that
can be linearly separated using a straight line. The straight line is described here in
coordinate form.
with only two inputs o1 and o2 . The Perceptron computes its activation act and then
compares the activation with the threshold T :
(
no ⇒ out = 0
act = o1 w1 + o2 w2 > T ? → (5.12)
yes ⇒ out = 1
So the decision boundary of a perceptron corresponds to the coordinate form of a
straight line - or according to the terminology above to a hyperplane. This means,
every such Perceptron with two inputs will output a 0 for all points (o1 , o2 ) ∈ R2 with
act = o1 w1 + o2 w2 ≤ T and will output a 1 for all points with act = o1 w1 + o2 w2 > T .
So it will always classify points with act ≤ T to belong to ”class 0” and points with
act > T to belong to ”class 1”. And therefore it depends on the distribution of the
CHAPTER 5. THE PERCEPTRON 79
Figure 5.7: A Perceptron with two inputs can only classify two sets of 2D points if
they are linearly separable!
points which we want to classify whether we have a chance to learn to classify them
with a Perceptron or not. If they are not linearly separable, we have no chance to find
weights w1 , w2 and w3 := T , that will produce the right classification outputs.
In their famous book ”Perceptrons: An Introduction to Computational Geometry”
[32] from 1969 Marvin Minsky and Seymour Papert showed that a Perceptron was
not able to learn weights such that it could classify a dataset correctly that shows a
XOR-pattern like distribution of the data points. See Fig. 5.8.
Figure 5.8: Try to find a straight line in each diagram that separates the ”+”s and
”-”s or accept that theses are examples of non-linearly separable sets.
CHAPTER 5. THE PERCEPTRON 81
Figure 5.9: A Perceptron cannot represent the XOR function since it can only rep-
resent linearly separable datasets correctly. However, with using an additional layer
of Perceptron neurons (”hidden” neurons), we can find weights, such that all inputs
x1, x2 are correctly classified by this Multi-Layer Perceptron: out3 = x1 ⊕ x2
6
Self-Organizing Maps
The Perceptron learning algorithm realizes a form of learning which is called error-
correction learning and falls into the category of supervised learning. We will now
consider another neural network model which falls into the category of unsupervised
learning and reveals another possible form of learning, namely competitive learning.
The Finish professor Teuvo Kohonen invented the Self-Organizing-Map (SOM) in 1982
[22]. My diploma thesis supervisor once was sitting in a session with Teuvo Kohonen.
Kohonen said: ”Please do not call it ’Kohonen map’ ! Call it ’Self-Organizing Map’”.
For this, we will avoid the name ”Kohonen map” in the following.
82
CHAPTER 6. SELF-ORGANIZING MAPS 83
What can we do with a SOM? Clustering! Even if we just have input vectors
we can learn a lot of useful things from this data. One important goal is, e.g., to
identify clusters within the data: Where in the input space can we find accumulations
/ agglomerations of input vectors? And how many of such agglomeration areas (called
clusters) are there? There are hundreds of different clustering algorithms available and
a SOM is an important neural network model to perform clustering. Identifying clus-
ters in your input data, describing them using cluster centers (also called centroids)
and their spatial extension in the input space allows you to get another view onto your
data.
Simple neural structure. The Perceptron was introduced first in this book since
it is the most simple neural network structure we can think of: one layer of neurons
which work independently from each other. The SOM has a similar simple structure.
However, the neurons are here not independent from each other. Each neuron has
some neighbored neurons with which it is connected. These connections are usually
chosen in some regular form. Mostly, a 2D neighborhood relationship is chosen. So
a neuron is connected with its neighbors, but not connected with all other neurons.
However, each neuron is connected to the input vector (see Fig. 6.1).
Neuron model used in a SOM. The neuron model used in a SOM is very differently
from the Perceptron neuron model. As said before, each neuron is connected to the
input vector v. A SOM neuron stores a weight vector w. If a new input vector is
present, each SOM neuron will compute the distance between its weight vector w and
the input vector v. Usually the Euclidean distance measure is used:
v
uN
uX
d(v, w) = kv − wk2 = t (vi − wi )2 (6.1)
i=1
CHAPTER 6. SELF-ORGANIZING MAPS 84
Only the Best Matching Unit and its neighbors adapt. For a given input
vector and all the computed neuron output values, the neuron which has the small-
est distance / output value is called the Best Matching Unit (BMU). This neuron is
allowed to adapt slightly into the direction of the input vector. And now comes the
neighborhood relationship into play. It is not only the BMU, but also its neighbors
that are allowed to adapt slightly into the direction of the input vector, while all other
neurons will keep their current weight vectors.
Learning formula. We can formulate this adaptation scheme into a formula. Let us
assume the index of the BMU neuron is a. Then the weight wb for each neuron with
index b in the SOM network is adapted according to the following formula:
SOM can be used for supervised learning as well. Typically SOMs are presented
as a purely unsupervised machine learning algorithm. But we can augment the SOM
CHAPTER 6. SELF-ORGANIZING MAPS 85
neurons with class information as well. Imagine we have not only the input vectors,
but for each input vector we have some class information, e.g.: this input vector
encodes a horse (a cat / a dog). Then we can augment the SOM neuron model by
class counter vectors and do not only adapt the BMU into the direction of the input
vector, but also increment the counter value of the current class in the BMU class
counter vector. Then after the learning we do not only have a map of the input data,
but also the information encoded in the class counter vectors of each neuron, which
class it corresponds to most likely. If a new input vector is presented, we can compute
the BMU and look up which class it most likely stands for, i.e., perform classification.
CHAPTER 6. SELF-ORGANIZING MAPS 86
Figure 6.1: A Self-Organizing Map (SOM) consists of a set of neurons with a neigh-
borhood relationship that defines whether two neurons are neighbored or not. The
SOM is trained unsupervised: input vectors are presented, then the Best Matching
Unit (BMU) is determined. The BMU and its neighbors adapt a little bit into the
direction of the input vector.
CHAPTER 6. SELF-ORGANIZING MAPS 87
Figure 6.2: A Self-Organizing Map (SOM) unfolds to data. The map is shown af-
ter adaptation steps 1,100,500,1000,2000,5000. In this example the map consists of
7x7=49 neurons (red circles) with a 2D topology depicted by the orange lines. The
gray circles represent data samples. In each iteration one sample is drawn from the
set of samples and the SOM adapts to this new (here: 2D) input vector.
CHAPTER 6. SELF-ORGANIZING MAPS 88
• som test.py: Data samples are generated here and a SOM. The SOM is then
fed in each step by a new data sample and adapts to the input data.
• som.py: Defines the SOM class. The most important methods here are
get neighbors() which defines a 2D topology / neighborhood relationship
between the SOM neurons and train() which adapts the Best Matching Unit
(BMU) and its neighbors into the direction of the input vector.
• som neuron.py: The SOM neuron class for generating single SOM neuron in-
stances. The method compute output() defines how the output of a SOM
neuron is computed given its current weight vector and an input vector. The
method adapt to vector() implements the adaptation of the neuron’s weight
vector into the direction of the input vector.
98
99
100 # 3.6 visualize positions of neurons
101 for i in range(NR_NEURONS):
102
103 # get the neurons weight vector and
104 # convert it to a tuple
105 neuron_coord =\
106 tuple( (my_som.list_neurons[i].weight_vec).
107 astype(int) )
108 cv2.circle(img, neuron_coord,
109 RADIUS_NEURONS, COLOR_NEURON, 2)
110
111
112 # 3.7 visualize neighborhood relationship of neurons
113 # by drawing a line between each two neighbored
114 # neurons
115 for i in range(NR_NEURONS):
116
117 # prepare the neuron’s coordinates as a tuple
118 # (for drawing coords)
119 neuron_i_coord =\
120 tuple((my_som.list_neurons[i].weight_vec).
121 astype(int))
122
123 # now get a list of all neighbors of this neuron
124 neighbors = my_som.get_neighbors(i)
125 # print("Neighbors of neuron ",i," are: ", neighbors)
126
127 # for all neighbors of this neuron:
128 for j in neighbors:
129
130 # prepare the neuron’s coordinates as a tuple
131 neuron_j_coord = \
132 tuple((my_som.list_neurons[j].weight_vec).
133 astype(int))
134
135 # draw a line between neuron i and
136 # its neighbored neuron j
137 cv2.line(img, neuron_i_coord, neuron_j_coord,
138 COLOR_NEIGHBORHOOD, 1)
139
140 # 3.8 show how many steps we have already trained
141 font = cv2.FONT_HERSHEY_SIMPLEX
142 cv2.putText(img,
143 str(my_som.nr_steps_trained).zfill(4),
144 (WIDTH-50, 20), font, 0.5, (0,0,0), 1,
145 cv2.LINE_AA)
146
147
CHAPTER 6. SELF-ORGANIZING MAPS 92
48 S = int(np.sqrt(nr_neurons))
49 self.neighborhood = \
50 np.arange(nr_neurons).reshape(S,S)
51
52 print("Neuron neighborhood:\n", self.neighborhood)
53
54
55 """
56 Initializes the neuron positions to
57 the specified rectangle
58 """
59 def initialize_neuron_weights_to_grid(self, rectangle):
60
61 S = int(np.sqrt(self.nr_neurons))
62
63 orig_x = rectangle[0]
64 orig_y = rectangle[1]
65 width = rectangle[2]
66 height = rectangle[3]
67
68 for id in range(self.nr_neurons):
69
70 # get the next neuron
71 neuron = self.list_neurons[id]
72
73 # compute a 2D coordinate in input space
74 # to initialize the weight vector with this
75 # 2D coordinate
76 grid_y = int(id / S)
77 grid_x = id % S
78 ispace_x = orig_x + grid_x * (width / S)
79 ispace_y = orig_y + grid_y * (height / S)
80
81 # store that coordinates
82 neuron.weight_vec[0] = ispace_x
83 neuron.weight_vec[1] = ispace_y
84
85
86 """
87 Initializes the neuron positions to origin
88 """
89 def initialize_neuron_weights_to_origin(self):
90
91 for id in range(self.nr_neurons):
92
93 # get the next neuron
94 neuron = self.list_neurons[id]
95
96 # store that coordinates
97 neuron.weight_vec[0] = 0
CHAPTER 6. SELF-ORGANIZING MAPS 95
98 neuron.weight_vec[1] = 0
99
100
101 """
102 Returns all the neighbors of a given neuron
103 Example:
104 2D Neuron neighborhood of 49 neurons arranged in a 7x7 grid:
105 [[ 0 1 2 3 4 5 6]
106 [ 7 8 9 10 11 12 13]
107 [14 15 16 17 18 19 20]
108 [21 22 23 24 25 26 27]
109 [28 29 30 31 32 33 34]
110 [35 36 37 38 39 40 41]
111 [42 43 44 45 46 47 48]]
112 """
113 def get_neighbors(self, id):
114
115 N = self.nr_neurons
116 S = int(np.sqrt(N))
117
118 # case #1: corner?
119
120 # top left corner
121 if id==0:
122 return [1,S]
123
124 # top right corner
125 if id==S-1:
126 return [S-2,2*S-1]
127
128 # bottom left corner:
129 if id==N-S:
130 return [N-S-S, N-S+1]
131
132 # bottom right corner:
133 if id==N-1:
134 return [N-1-S, N-1-1]
135
136
137 # case #2: border?
138 y = int(id / S)
139 x = id % S
140
141 # top border
142 if (y==0):
143 return [id-1,id+1,id+S]
144
145 # bottom border
146 if (y==S-1):
147 return [id-1,id+1,id-S]
CHAPTER 6. SELF-ORGANIZING MAPS 96
148
149 # left border
150 if (x==0):
151 return [id-S,id+S,id+1]
152
153 # right border
154 if (x==S-1):
155 return [id-S,id+S,id-1]
156
157
158 # case #3: normal cell?
159 return [id-S,id-1,id+1,id+S]
160
161
162
163
164 """
165 Train the SOM with one more training vector,
166 i.e.,
167 - determine the Best Matching Unit (BMU)
168 - adapt the BMU and its neighbored neurons
169 into the direction of the input vector
170 """
171 def train(self, input_vec, learn_rate, adapt_neighbors):
172
173 self.nr_steps_trained += 1
174
175 if (self.nr_steps_trained % 1000 == 0):
176 print("SOM has been trained for",
177 self.nr_steps_trained, "steps.")
178
179 # 1. let all neurons comput their output values
180 for neuron_nr in range(self.nr_neurons):
181
182 # get the next neuron
183 neuron = self.list_neurons[neuron_nr]
184
185 # compute new output value of neuron
186 neuron.compute_output(input_vec)
187
188
189 # 2. now determine the Best Matching Unit (BMU),
190 # i.e., the neuron with the smallest output
191 # value (each neuron computes the distance of
192 # its weight vector to the input vector)
193 BMU_nr = 0
194 minimum_dist = self.list_neurons[0].output
195 for neuron_nr in range(1,self.nr_neurons):
196
197 if (self.list_neurons[neuron_nr].output < minimum_dist):
CHAPTER 6. SELF-ORGANIZING MAPS 97
47
48 # compute final 2D sample coordinates
49 x = center_x + rnd_offset_x
50 y = center_y + rnd_offset_y
51
52 # is the sample within the image dimension?
53 if (x<0): x=0
54 if (y<0): y=0
55 if (x>self.img_width) : x = self.img_width
56 if (y>self.img_height): y = self.img_height
57
58 # store the sample coordinate in the list
59 data_samples.append( np.array([x,y]) )
60
61 return data_samples
Motor homunculi. Directly beside the primary somatosensory cortex we can find
the primary motor cortex which works together with other motor areas in order to
plan and execute movements. Similar to the primary somatosencoric cortex we can
find a distorted somatotopic map for body parts where the amount of cortex which
CHAPTER 6. SELF-ORGANIZING MAPS 101
is devoted to a body part is not proportional to the absolute size of this part, but
to the density of cutaneous receptors on the body part. Cutaneous receptors allow
to measure, e.g., how much the skin is stretched (Ruffini’s end organ), changes in
texture (Meissner’s corpuscle), pain (nociceptors) and temperature (thermoceptors).
The density of these receptors is an indicator how precisely movements are required
for that body part. E.g., the human hands and the face have a much larger representa-
tion than the legs. So beside two sensory homonculi we can find two motor homonculi.
Figure 6.3: In the primary sensory motor cortex tactile information is processed such
that neighbored body regions are processed in neighbored cortex areas as well. How-
ever, the areas of such a region responsible for a certain body part does not correspond
to the size of the body part, but is proportional to the density of tactile receptors
present in this body part.
CHAPTER 6. SELF-ORGANIZING MAPS 102
Changes of cortical maps due to training. Many experiments show that training
of specific body parts can change the form of the sensoric and motoric homunculi.
E.g. it was shown that a daily piano practice for the fingers for 2 hours will already
result in a perceptible change of the motoric homunculi. Similarly, a golf training for
40 hours [2] or a juggle training for three months increases the gray matter involved
in task relevant cortical areas [8].
Neocortex
A layered structure. The neocortex (also called isocortex) is a 2-4mm thick part of
the brain that is involved in ”higher” cognitive functions such as perception, cognition
and generation of motor commands. It is widely accepted that it can be divided into
six layers, which are labeled with roman numbers from I to VI, where I is the outer-
most and VI is the innermost layer. Interestingly it has a smooth surface in rodent
brains, whereas in primates the surface reminds of a walnut with deep grooves (called
sulci) and ridges (called gyri). A popular hypothesis is that the reason for this wavy
surface is that it allows the surface of the neocortex to be increased. How can we say
that it consists of six layers? The layers can visually be discriminated due to different
neuronal cell types and different connection patterns. However, there are some excep-
tions from this typical six layer structure. E.g., in the primary sensory cortex layer V
is very small or even not present. There is a regular connection pattern: e.g., neurons
in layer IV receive the majority of their inputs from outside the cortex, namely mostly
the thalamus - a brain region that is said to play an important role in deciding which
information will be sent to the cortex. Neurons in layer IV then send their outputs to
CHAPTER 6. SELF-ORGANIZING MAPS 103
neurons in other layers. Pyramidal neurons in the ”upper layers” II and III project to
other areas of the neocortex, while neurons from layers V and VI mostly project out
of the cortex to brain structures as the thalamus, the brain stem and the spinal cord.
Neocortex vs. Cortex. Note that the terms neocortex and (cerebral) cortex do not
exactly mean the same. The neocortex is the ”newest” part of the cortex from an
evolutionary view point, while the other older parts of the cortex are called allocortex.
Cortical minicolumns. The layered structure and the regular connection pattern
found in the neocortex is a highly interesting observation since it rises the question:
What is the fundamental algorithm or information flow that happens in this impor-
tant brain structure? This question has not yet been answered. However, another
important observation has been made: the cortex is structured into minicolumns and
macrocolumns. A cortical minicolumn (or cortical microcolumn) is a vertical column
of ca. 80-120 neurons (some authors estimate the number of neurons to 50-100) with
a diameter of the column of ca. 50 µm. About 200.000.000 minicolumns can be found
in the human brain. Vernon Mountcastle described it for the first time and suggested
that it could be an elementary unit of information processing [33].
An open discussion. It is still unclear whether the cortical columns are just a by-
product or whether they have a function regarding the information processing. On the
one hand, some others do not see the necessity that the columnar structures that can
be found serve a purpose. In their paper ”The cortical column: a structure without a
function” by Horton and Adams [17] the authors say:
”Equally dubious is the concept that minicolumns are basic modular units
of the adult cortex, rather than simply remnants of fetal development.”
On the other hand, other authors have developed theories of Neocortical information
CHAPTER 6. SELF-ORGANIZING MAPS 104
processing around these columnar organization. E.g. Rinkus [39] suggests in his
paper ”A cortical sparse distributed coding model linking mini- and macrocolumn-scale
functionality” that
Another prominent example of a theory built around the idea of columnar organization
is the Hierarchical Temporal Memory model by Jeff Hawkins which allows to store,
infer and recall sequences in an unsupervised fashion, i.e., using unlabeled data.
Figure 6.4: The Neocortex is a six-layered neuronal structure with a relatively stable
pattern of layer thickness, neurons found in these layers and connections between
neurons.
CHAPTER 6. SELF-ORGANIZING MAPS 106
Figure 6.5: It seems that the Neocortex is not only structured into six layers but
also structured into minicolumns where neurons encode the same feature and macro-
columns consisting of 50-100 minicolumns where each minicolumn has the same re-
ceptive field but represents another feature.
7
Multi Layer Perceptrons
107
CHAPTER 7. MULTI LAYER PERCEPTRONS 108
We want the MLP to map each training vector x to the corresponding output vector
t (also called teacher vector. But actually we want more. We also want the MLP
to generalize, such that a new - never seen before - input vector x is mapped to the
output vector y if f (x) = y.
Note that this desired generalization behavior will not be formulated directly into
the learning problem! For this, we can just hope, that the MLP trained with some
input/output-examples will generalize for new input vectors. Later we will discuss
some regularization techniques that indeed guide the learning process such that gen-
eralization is promoted and not memorization.
Definition of an error function. For realizing a learning process, an error for
individual training samples and for the whole training dataset is defined. The error of
an individual training pattern pi = (x, t) is defined as the sum of squared differences
between the desired output vector values tim and the actual output values ym i of the
MLP that has processed the input vector with the current set of weights W . The index
b denotes the b-th output neuron and index i stands for the i-th training pattern. This
error is defined as:
B
1X i
E = E(W, pi ) = (tb − ybi )2 (7.2)
2
b=1
where B is the number of output neurons. Why the 21 ? You will see in the following
that it is just introduced for convenience reasons to compensate for a factor of 2 when
we will compute derivatives of this error function.
Now we can also define an error for a given weight configuration W and a whole
dataset D:
d
1X
ED = E(W, D) = E(W, pi ) (7.3)
d
i=1
Figure 7.1: For a single layer Perceptron there are only output neurons and no hidden
neurons. So we can compute an error signal for each neuron. In a Multi Layer
Perceptron there are output neurons and hidden neurons. How shall we define an
error for a hidden neuron? And if we cannot define an error for hidden neurons how
should we change the weights to these neurons?
The basic idea: walk on the error surface into directions of smaller errors.
Consider Fig. 7.2. Given a very small neural network with just two weights w0 and
w1 the error E for an individual input vector depends on just these two weights. The
resulting error value can be plotted as a function E(w0 , w1 ) of these two weights. Now
to reduce the error, the idea of gradient descent is to compute the gradient of E at
the position on this error surface corresponding to the current weight values (w0 , w1 ).
The gradient of the error is defined as the vector of partial derivatives:
∂E
w0
grad(E) = ∇E = ∂E (7.4)
w1
Since the gradient points into the direction of steepest ascent, we will use the negative
CHAPTER 7. MULTI LAYER PERCEPTRONS 110
gradient to ”do a step” into the steepest descent. With this approach we will change
the weights such that error is reduced. Problem: it might happen that we get stuck
in a local minimum of the error surface.
Figure 7.2: Idea of gradient descent. In order to minimize the error E that depends
in this example on two parameters w0 and w1 we compute the gradient at the current
values of w0 and w1 . This results in the red vector. However, this red vector points into
the direction of the steepest ascent. Since we want to minimize E and not maximize
E, we go into the opposite direction which is visualized by the yellow vector.
opposite direction of the derivative of the error with respect to the weight:
∂E
∆wkj = −α (7.5)
∂wkj
where α is the gradient descent step width and can be called a learn rate.
We start with applying the chain rule. The key step in deriving the Backprop-
agation learning formulas is to apply the chain rule of differential calculus:
∂E ∂yj ∂actj
∆wkj = −α (7.6)
∂yj ∂actj ∂wkj
|{z} | {z } | {z }
3 2 1
The application of the chain rule allows to follow a divide-and-conquer strategy. The
large problem which is to compute ∆wkj has been split into three smaller problems,
namely to compute the factors 1,2 and 3.
Definition of an error signal. The product of the two terms 3 and 2 together with
the minus sign is called the error term or error signal in the context of MLPs. Thus
the error signal is:
∂E ∂yj
δj := − (7.7)
∂yj ∂actj
|{z} | {z }
3 2
∂actj
∆wkj = αδj (7.8)
∂wkj
So the derivative is just yk ! Why? If we derive the function in the numerator with
respect to the denominator wkj all the other variables in the sum can be regarded as
constants regarding the derivation process. So we have to derive wkj yk with respect
to wkj . The derivative of this function is yk .
∂yj ∂f (actj )
= = f 0 (actj ) (7.10)
∂actj ∂actj
| {z }
2
B
∂ 12 (tb − yb )2
P
∂E b=1 ∂ 12 (tj − yj )2 1
= = = 2(tj − yj ) ∗ (0 − 1) = yj − tj (7.11)
∂yj ∂yj ∂yj 2
Great! So for the case of output neurons we have already computed all the three
parts and can make up a final weight change formula for output neurons by gluing the
results together:
∂E ∂yj ∂actj
∆wkj = −α = −α (yj − tj ) f 0 (actj ) yk (7.12)
∂yj ∂actj ∂wkj | {z } | {z } |{z}
|{z} | {z } | {z } 3 2 1
3 2 1
∂z ∂z ∂x ∂z ∂y
= + (7.13)
∂t ∂x ∂t ∂y ∂t
CHAPTER 7. MULTI LAYER PERCEPTRONS 114
R and g : Rn → Rm with
Note that in the 2D case of the MCR described above m was 2 with u1 = x(t) and
u2 = y(t).
Using the MCR to compute the third part for hidden neurons. Now given
the MCR we can compute the third part for neuron j living in layer l + 1 as follows
with the help of all the error signals from the following neurons i living in layer l + 2:
N
X ∂E ∂y l+2
∂E ∂E i
= l+1
= (7.16)
∂yj ∂yj i=1
∂yi ∂yil+1
l+2
So how can this be interpreted? We want to compute how the error E changes if we
change the output of the neuron j in layer l + 1. For this we applied the MCR and
compute instead how the error E changes if we change the output of neurons i in the
next layer l + 2 and how their outputs yil+2 change if we change the output of neuron
j in layer l + 1, i.e., yjl+1 .
CHAPTER 7. MULTI LAYER PERCEPTRONS 116
N N
X ∂E ∂yil+2 X ∂E ∂yil+2 ∂actl+2 i
= (7.17)
∂yil+2 ∂yil+1 ∂y
i=1 | i
l+2
∂actl+2 l+1
∂yj
i=1 {z i }
=−δil+2
N
X ∂actl+2
= − δil+2 i
(7.18)
i=1
∂yjl+1
R
P l+1 l+1
N ∂ wri yr
r=1
X
= − δil+2 (7.19)
i=1
∂yjl+1
N
X
= − δil+2 wji
l+1
(7.20)
i=1
The battle of induces is over! Now let us also glue together the three parts that we
have computed for the case of a hidden neuron:
N
∂E ∂yj ∂actj X
∆wkj = −α = −α (− δil+2 wji
l+1
) f 0 (actl+1 ykl
j ) |{z} (7.21)
∂yj ∂actj ∂wkj
| i=1 {z
| {z }
|{z} | {z } | {z } } 2 1
3 2 1 3
So the difference regarding the weight update formula between output neurons and
non-output neurons is only the error signal! The rest is the same. The error signal for
neuron j refers to the N error signals from the neurons i in the following layer which
have to be computed first.
Why the name Backpropagation? It are the error signals δi that are propagated
back through the network neurons in previous layers and which gave this learning
algorithm its name Backpropagation of error signals. Since we need the error signals
from layer l + 2 if we want to compute the error for a neuron in layer l + 1 we first
need to compute the error signals for the output neurons, then we can compute the
error signals for neurons in the previous layer and so forth. So the layer wise update
scheme during the error signal computation phase is the opposite direction compared
to the feedforward step: we start at the last layer and move towards the input layer.
mulas in order to adapt the weights of a given MLP. Note that the approach presented
here is called Stochastic Gradient Descent (SGD) with one training sample: one train-
ing sample is feedforwarded, the weight changes are computed for all weights and
directly applied. This means that the gradient descent steps are based on the evalua-
tion of just one training sample from the dataset D.
SGD and BGD walk different paths. SGD and BGD will take different paths
to a local minimum of the error surface. Due to its stochastic nature, SGD will of-
ten walk ”zig-zag” paths towards a local minimum, while the path chosen by BGD is
more ”smooth” and direct towards a local minimum. The term ”stochastic” goes back
to the fact that the error gradient based on a single training sample that is used in
SGD can be considered a ”stochastic approximation” of the true error gradient that
is computed by BGD. SGD works well in practice and only rarely BGD is used. One
reason is probably that for error surfaces that have lots of local maxima and minima
the noisier gradient calculated by SGD allows to escape a local minimum which BGD
cannot.
18 RADIUS_SAMPLE = 3
19 COLOR_CLASS0 = (255,0,0)
20 COLOR_CLASS1 = (0,0,255)
21 NR_TEST_SAMPLES = 10000
22
23 # for saving images
24 image_counter = 0
25
26
27 def visualize_decision_boundaries(the_mlp, epoch_nr):
28
29 global image_counter
30
31 # 1. generate empty white color image
32 img = np.ones((WINSIZE, WINSIZE, 3), np.uint8) * 255
33
34 for i in range(NR_TEST_SAMPLES):
35
36 # 2. generate random coordinate in [0,1) x [0,1)
37 rnd_x = np.random.rand()
38 rnd_y = np.random.rand()
39
40 # 3. prepare an input vector
41 input_vec = np.array( [rnd_x, rnd_y] )
42
43 # 4. now do a feedforward step with that
44 # input vector, i.e., compute the output values
45 # of the MLP for that input vector
46 the_mlp.feedforward( input_vec )
47
48 # 5. now get the predicted class from the output
49 # vector
50 output_vec = the_mlp.get_output_vector()
51 class_label = 0 if output_vec[0]>output_vec[1] else 1
52
53 # 6. Map class label to a color
54 color = COLOR_CLASS0 if class_label == 0 else COLOR_CLASS1
55
56 # 7. Draw circle
57 sample_coord = (int(input_vec[0] * WINSIZE),
58 int(input_vec[1] * WINSIZE))
59 cv2.circle(img, sample_coord, RADIUS_SAMPLE, color)
60
61 # 8. show image with decision boundaries
62 cv2.rectangle(img, (WINSIZE-120,0), (WINSIZE-1,20),
63 (255,255,255), -1)
64 font = cv2.FONT_HERSHEY_SIMPLEX
65 cv2.putText(img,
66 "epoch #"+str(epoch_nr).zfill(3),
67 (WINSIZE - 110, 15), font, 0.5, (0, 0, 0), 1,
CHAPTER 7. MULTI LAYER PERCEPTRONS 120
68 cv2.LINE_AA)
69 cv2.imshow(’Decision boundaries of trained MLP’, img)
70 c = cv2.waitKey(1)
71
72 # 9. save that image?
73 if True:
74 filename = "V:/tmp/img_{0:0>4}".format(image_counter)
75 image_counter +=1
76 cv2.imwrite(filename + ".png", img)
77
78
79
80 # 1. create a new MLP
81 my_mlp = mlp()
82
83
84 # 2. build a MLP
85 my_mlp.add_layer(2, TF.identity)
86 my_mlp.add_layer(10, TF.sigmoid)
87 my_mlp.add_layer(6, TF.sigmoid)
88 my_mlp.add_layer(2, TF.identity)
89
90
91 # 3. generate training data
92 my_dg = data_generator()
93 data_samples = \
94 my_dg.generate_samples_two_class_problem(NR_CLUSTERS,
95 NR_SAMPLES_TO_GENERATE)
96 nr_samples = len(data_samples)
97
98
99 # 4. generate empty image for visualization
100 # initialize image with white pixels (255,255,255)
101 img = np.ones((WINSIZE, WINSIZE, 3), np.uint8) * 255
102
103
104 # 5. visualize positions of samples
105 for i in range(nr_samples):
106
107 # 5.1 get the next data sample
108 next_sample = data_samples[i]
109
110 # 5.2 get input and output vector
111 # (which are both NumPy arrays)
112 input_vec = next_sample[0]
113 output_vec = next_sample[1]
114
115 # 5.3 prepare a tupel from the NumPy input vector
116 sample_coord = (int(input_vec[0]*WINSIZE),
117 int(input_vec[1]*WINSIZE))
CHAPTER 7. MULTI LAYER PERCEPTRONS 121
118
119 # 5.4 get class label from output vector
120 if output_vec[0]>output_vec[1]:
121 class_label = 0
122 else:
123 class_label = 1
124 color = (0,0,0)
125 if class_label == 0:
126 color = COLOR_CLASS0
127 elif class_label == 1:
128 color = COLOR_CLASS1
129
130 # 5.5
131 cv2.circle(img, sample_coord, RADIUS_SAMPLE, color)
132
133
134 # 6. show visualization of samples
135 cv2.imshow(’Training data’, img)
136 c = cv2.waitKey(1)
137 cv2.imwrite("V:/tmp/training_data.png", img)
138 #cv2.destroyAllWindows()
139
140
141 # 7. now train the MLP
142 my_mlp.set_learn_rate( LEARN_RATE )
143 for epoch_nr in range(NR_EPOCHS):
144
145 print("Training epoch#", epoch_nr)
146
147 sample_indices = np.arange(nr_samples)
148 np.random.shuffle(sample_indices)
149
150 for train_sample_nr in range(nr_samples):
151
152 # get index of next training sample
153 index = sample_indices[train_sample_nr]
154
155 # get that training sample
156 next_sample = data_samples[index]
157
158 # get input and output vector
159 # (which are both NumPy arrays)
160 input_vec = next_sample[0]
161 output_vec = next_sample[1]
162
163 # train the MLP with that vector pair
164 my_mlp.train(input_vec, output_vec)
165
166 print("\nMLP state after training epoch #",epoch_nr,":")
167 my_mlp.show_weight_statistics()
CHAPTER 7. MULTI LAYER PERCEPTRONS 122
168 my_mlp.show_neuron_states()
169 visualize_decision_boundaries(my_mlp, epoch_nr)
170 #input("Press Enter to train next epoch")
171
172 print("MLP test finished.")
43 derivative_sigmoid = np.vectorize(derivative_sigmoid)
44 derivative_relu = np.vectorize(derivative_relu)
45 derivative_skew_ramp = np.vectorize(derivative_squared)
46
47
48 class TF:
49 identity = 1
50 sigmoid = 2
51 relu = 3
52 squared = 4
53
54
55
56
57 class mlp:
58
59 nr_layers = 0
60 nr_neurons_per_layer = []
61 tf_per_layer = []
62 weight_matrices = []
63 neuron_act_vecs = []
64 neuron_out_vecs = []
65
66 learn_rate = 0.01
67 neuron_err_vecs = []
68
69 def __init__(self):
70 print("Generated a new empty MLP")
71
72
73 """
74 Returns the output vector of the MLP
75 as a NumPy array
76 """
77 def get_output_vector(self):
78
79 return self.neuron_out_vecs[len(self.neuron_out_vecs)-1]
80
81
82
83 def show_architecture(self):
84
85 print("MLP architecture is now: ", end=" ")
86
87 for i in range(self.nr_layers):
88 print(str(self.nr_neurons_per_layer[i]), end=" ")
89
90 print("\n")
91
92
CHAPTER 7. MULTI LAYER PERCEPTRONS 124
93 """
94 Adds a new layer of neurons
95 """
96 def add_layer(self, nr_neurons, transfer_function):
97
98 # 1. store number of neurons of this new layer
99 # and type of transfer function to use
100 self.nr_neurons_per_layer.append( nr_neurons )
101 self.tf_per_layer.append( transfer_function )
102
103 # 2. generate a weight matrix?
104 if self.nr_layers>=1:
105
106 # 2.1 how many neurons are there in the
107 # previous layer?
108 nr_neurons_before =\
109 self.nr_neurons_per_layer[self.nr_layers-1]
110
111 # 2.2 initialize weight matrix with random
112 # values from (0,1)
113 # Do not forget the BIAS input for each
114 # neuron! For this: nr_neurons_before + 1
115 W = np.random.uniform(low=-1.0, high=1.0,
116 size=(nr_neurons_before+1,nr_neurons))
117
118 # 2.3 store the new weight matrix
119 self.weight_matrices.append(W)
120
121 # 2.4 output some information about the
122 # weight matrix just generated
123 print("Generated a new weight matrix W. Shape is",
124 W.shape)
125 size = W.nbytes/1024.0
126 print("Size of weight matrix in KB"
127 " is {0:.2f}".format(size))
128
129
130 # 3. generate a new neuron activity and
131 # # neuron output vector
132 act_vec = np.zeros(nr_neurons)
133 out_vec = np.zeros(nr_neurons)
134 err_vec = np.zeros(nr_neurons)
135 self.neuron_act_vecs.append( act_vec )
136 self.neuron_out_vecs.append( out_vec )
137 self.neuron_err_vecs.append( err_vec )
138
139 # 4. update number of layers
140 self.nr_layers += 1
141
142 # 5. show current MLP architecture
CHAPTER 7. MULTI LAYER PERCEPTRONS 125
143 self.show_architecture()
144
145
146 """
147 Given an input vector, we compute
148 the output of all the neurons layer by layer
149 into the direction of the output layer
150 """
151 def feedforward(self, input_vec):
152
153 # 1. set output of neurons from first layer
154 # to input vector values
155 N = len(input_vec)
156 self.neuron_out_vecs[0] = input_vec
157
158 # 2. now compute neuron outputs layer by layer
159 for layer_nr in range(1,self.nr_layers):
160
161 # 2.1 get output vector previously computed
162 o = self.neuron_out_vecs[layer_nr-1]
163
164 # 2.2 add bias input
165 o = np.append([1], o)
166
167 # 2.3 vectors are one-dimensional
168 # but for matrix*matrix multiplication we need
169 # a matrix in the following
170 N = len(o)
171 o_mat = o.reshape(1,N)
172
173 # 2.4 now get the right weight matrix
174 W = self.weight_matrices[layer_nr-1]
175
176 # 2.5 compute the product of the output (vector)
177 # and the weight matrix to get the output values
178 # of neurons in the current layer
179 act_mat_this_layer = np.matmul(o_mat,W)
180
181 # 2.6 apply transfer function
182 if self.tf_per_layer[layer_nr]==TF.sigmoid:
183 out_mat_this_layer =\
184 func_sigmoid(act_mat_this_layer)
185 elif self.tf_per_layer[layer_nr]==TF.identity:
186 out_mat_this_layer =\
187 func_identity(act_mat_this_layer)
188 elif self.tf_per_layer[layer_nr]==TF.relu:
189 out_mat_this_layer = \
190 func_relu(act_mat_this_layer)
191 elif self.tf_per_layer[layer_nr]==TF.squared:
192 out_mat_this_layer = \
CHAPTER 7. MULTI LAYER PERCEPTRONS 126
193 func_squared(act_mat_this_layer)
194
195 # 2.7 store activity and output of neurons
196 self.neuron_act_vecs[layer_nr] = \
197 act_mat_this_layer.flatten()
198 self.neuron_out_vecs[layer_nr] = \
199 out_mat_this_layer.flatten()
200
201
202
203 """
204 Show output values of all neurons
205 in the specified layer
206 """
207 def show_output(self, layer):
208
209 print("output values of neuron in layer",layer,":",
210 self.neuron_out_vecs[layer])
211
212
213 """
214 Shows some statistics about the weights,
215 e.g. what is the maximum and the minimum weight in
216 each weight matrix
217 """
218 def show_weight_statistics(self):
219
220 for layer_nr in range(0,self.nr_layers-1):
221 W = self.weight_matrices[layer_nr]
222 print("Weight matrix for weights from"
223 "layer #",layer_nr,"to layer #",
224 layer_nr+1, ":")
225 print("\t shape:", W.shape)
226 print("\t min value: ", np.amin(W))
227 print("\t max value: ", np.amax(W))
228 print("\t W", W)
229 print("\n")
230
231
232 """
233 Show state of neurons (activity and output values)
234 """
235 def show_neuron_states(self):
236
237 for layer_nr in range(0, self.nr_layers):
238 print("Layer #", layer_nr)
239 print("\t act:", self.neuron_act_vecs[layer_nr])
240 print("\t out:", self.neuron_out_vecs[layer_nr])
241 print("\n")
242
CHAPTER 7. MULTI LAYER PERCEPTRONS 127
243
244 """
245 Set a new learn rate which is used in the
246 weight update step
247 """
248 def set_learn_rate(self, new_learn_rate):
249 self.learn_rate = new_learn_rate
250
251
252 """
253 Given a pair (input_vec, teacher_vec) we adapt
254 the weights of the MLP such that the desired output vector
255 (which is the teacher vector)
256 is more likely to be generated the next time if the
257 input vector is presented as input
258
259 Note: this is the Backpropagation learning algorithm!
260 """
261 def train(self, input_vec, teacher_vec):
262
263 # 1. first do a feedfoward step with the input vector
264 self.feedforward(input_vec)
265
266 # 2. first compute the error signals for the output
267 # neurons
268 tf_type = self.tf_per_layer[self.nr_layers-1]
269 nr_neurons = self.nr_neurons_per_layer[self.nr_layers-1]
270 act_vec = self.neuron_act_vecs[self.nr_layers-1]
271 out_vec = self.neuron_out_vecs[self.nr_layers-1]
272 err_vec = -(out_vec-teacher_vec)
273 if tf_type==TF.sigmoid:
274 err_vec *= derivative_sigmoid(act_vec)
275 elif tf_type==TF.identity:
276 err_vec *= derivative_identity(act_vec)
277 elif tf_type==TF.relu:
278 err_vec *= derivative_relu(act_vec)
279 elif tf_type==TF.squared:
280 err_vec *= derivative_squared(act_vec)
281 self.neuron_err_vecs[self.nr_layers-1] = err_vec
282
283 # 3. now go from layer N-1 to layer 2 and
284 # compute for each hidden layer the
285 # error signals for each neuron
286
287 # going layer for layer backwards ...
288 for layer_nr in range(self.nr_layers-2, 0, -1):
289
290 nr_neurons_this_layer = \
291 self.nr_neurons_per_layer[layer_nr]
292 nr_neurons_next_layer = \
CHAPTER 7. MULTI LAYER PERCEPTRONS 128
293 self.nr_neurons_per_layer[layer_nr+1]
294 W = self.weight_matrices[layer_nr]
295 act_vec = self.neuron_act_vecs[layer_nr]
296 tf_type = self.tf_per_layer[layer_nr]
297
298 # run over all neurons in this layer ...
299 for neuron_nr in range(0,nr_neurons_this_layer):
300
301 # compute the sum of weighted error signals from
302 # neurons in the next layer
303 sum_of_weighted_error_signals = 0.0
304
305 # run over all neurons in next layer ...
306 for neuron_nr2 in range (0,nr_neurons_next_layer):
307
308 # get error signal for neuron_nr2 in next layer
309 err_vec = self.neuron_err_vecs[layer_nr+1]
310 err_signal = err_vec[neuron_nr2]
311
312 # get weight from
313 # neuron_nr in layer_nr to
314 # neuron_nr2 in layer_nr+1
315 #
316 # Important:
317 # at W[0][neuron_nr2] is the bias
318 # weight to neuron_nr2
319 # at W[1][neuron_nr2] is the first
320 # "real" weight to neuron_nr2
321 weight = W[neuron_nr+1][neuron_nr2]
322
323 # update sum
324 sum_of_weighted_error_signals +=\
325 err_signal * weight
326
327 # compute and store error signal for
328 # neuron with id neuron_nr in this layer
329 err_signal = sum_of_weighted_error_signals
330 if tf_type == TF.sigmoid:
331 err_signal *= \
332 derivative_sigmoid(act_vec[neuron_nr])
333 elif tf_type == TF.identity:
334 err_signal *= \
335 derivative_identity(act_vec[neuron_nr])
336 elif tf_type == TF.relu:
337 err_signal *= \
338 derivative_relu(act_vec[neuron_nr])
339 elif tf_type == TF.squared:
340 err_signal *= \
341 derivative_squared(act_vec[neuron_nr])
342 self.neuron_err_vecs[layer_nr][neuron_nr] =\
CHAPTER 7. MULTI LAYER PERCEPTRONS 129
343 err_signal
344
345
346 # 4. now that we have the error signals for all
347 # neurons (hidden and output neurons) in the net
348 # computed, let’s change the weights according to
349 # the weight update formulas
350 for layer_nr in range(self.nr_layers - 1, 0, -1):
351
352 nr_neurons_this_layer = \
353 self.nr_neurons_per_layer[layer_nr]
354 nr_neurons_prev_layer = \
355 self.nr_neurons_per_layer[layer_nr-1]
356
357 for neuron_nr in range(0, nr_neurons_this_layer):
358
359 # get error signal for this neuron
360 err_signal = \
361 self.neuron_err_vecs[layer_nr][neuron_nr]
362
363 for weight_nr in range(0, nr_neurons_prev_layer+1):
364
365 # get output value of sending neuron
366 out_val_sending_neuron = 1
367 if weight_nr>0:
368 out_val_sending_neuron = \
369 self.neuron_out_vecs[layer_nr-1][weight_nr-1]
370
371 # compute weight change
372 weight_change = \
373 self.learn_rate * \
374 err_signal * \
375 out_val_sending_neuron
376
377 self.weight_matrices[layer_nr-1][weight_nr][neuron_nr]
+= \
378 weight_change
11
12 np.random.seed(4)
13 CLUSTER_RADIUS = 0.2
14
15 # 1. generate random cluster coordinates
16 clusters = []
17 for i in range(nr_samples_to_generate):
18
19 # 1.1 generate random cluster center
20 center_x = np.random.rand()
21 center_y = np.random.rand()
22
23 # 1.2 store that center
24 clusters.append( np.array([center_x,center_y]) )
25
26
27 # 2. generate random samples
28 data_samples = []
29 for i in range(nr_samples_to_generate):
30
31 # 2.1 generate random coordinate
32 rnd_x = np.random.rand()
33 rnd_y = np.random.rand()
34 rnd_coord = np.array( [rnd_x,rnd_y] )
35
36 # 2.2 check whether that coordinate is
37 # near to a cluster
38 # if yes, we say it belongs to class 1
39 # if no, we say it belongs to class 0
40 class_label = 0
41 for j in range(nr_clusters):
42
43 # get cluster coordinates
44 cluster_coords = clusters[j]
45
46 # compute distance of sample (rnd_x,rnd_y) to
47 # cluster coordinates (center_x,center_y)
48 dist = np.linalg.norm( cluster_coords - rnd_coord )
49
50 # is the sample near to that cluster
51 if dist < CLUSTER_RADIUS:
52 class_label = 1
53 break
54
55 # 2.3 store the sample
56 input_vec = np.array([rnd_x, rnd_y])
57 output_vec = np.array([1-class_label, class_label])
58 data_samples.append( [input_vec,output_vec] )
59
60
CHAPTER 7. MULTI LAYER PERCEPTRONS 131
Figure 7.3: These are the final weight update formulas for the Backpropagation learn-
ing algorithm.
CHAPTER 7. MULTI LAYER PERCEPTRONS 133
Two further transfer functions. Fig. 7.4 shows the graphs of two further transfer
functions which are now often used. Note that they are not biologically plausible
since they are unbounded. However, the advantage of both is that there is no need to
compute the exponential function which is computationally costly.
Figure 7.5: Decision boundaries of a MLP after epoch 0-18. The training data used
is shown in the last image. The MLP is a 2id-10sig-6sig-2id network with 2 input
neurons with the identity as transfer function, 10 neurons in the first hidden layer
with sigmoid transfer functions, 6 neurons in the second hidden layer with sigmoid
transfer functions and 2 output neurons with identity as transfer function.
CHAPTER 7. MULTI LAYER PERCEPTRONS 136
Figure 7.6: Decision boundaries of a MLP after epoch 0-18. The training data used is
shown in the last image. The MLP is a 2id-10relu-6relu-2id network with the ReLU
as transfer function in the hidden layers.
CHAPTER 7. MULTI LAYER PERCEPTRONS 137
No advantage of using several layers. We have seen in section 5.4 that a single
layer Perceptron can only separate the input space into two half-spaces. We could
expect that this must not hold true for a MLP with several layers. However, we can
easily show that even if we use several layers of neurons with identity transfer func-
tions we still end up with a classifier that can only divide the input space into two
half-spaces. Fig. 7.8 illustrates why. Each MLP with neurons that just use the iden-
tity as transfer function can be transformed successively into a single layer Perceptron
which produces the same output. Since single layer Perceptrons are know to be just
linear separators, this must hold true for a MLP with just linear transfer functions as
well. This is the reason why we need non-linear transfer functions in the hidden layers.
Figure 7.7: Decision boundaries of a MLP after epoch 0-18. The training data used
is shown in the last image. The MLP is a 2id-10id-6id-2id network that uses only the
identity as transfer function in all layers.
CHAPTER 7. MULTI LAYER PERCEPTRONS 139
Figure 7.8: A MLP with hidden neurons that use just the transfer functions does not
give us any advantage over using a single layer Perceptron since for each MLP we
can find a single layer Perceptron (here: with weights w1000 , w2000 , b000 ) that produces the
same output. Since single layer Perceptrons are known to be limited to separating the
input space by a hyperplane into two half spaces this also has to hold true for a MLP
that uses just linear transfer functions.
8
TensorFlow
8.1 Introduction
Basic idea behind TensorFlow
The basic idea behind Google’s Deep Learning library TensorFlow is to describe all
computations as a graph. Nodes in this (directed) graph describe operations on the
data and the data itself is packed into n-dimensional arrays. In the TensorFlow jargon
these n-dimensional arrays are called tensors. Since the data is reached from one node
to the other it is the data that ”flows” through the graph. For this the name Tensor-
Flow. Note that the tensors in a computation graph can have different shapes: E.g.,
there can be 0-dimensional tensors (”scalars”), 1-dimensional tensors (”vectors”), 2-
dimensional tensors (”matrices”), 3-dimensional tensors (”cuboids of data”), etc.
Version check
At the time of writing this text line (November 2017) the newest version of TensorFlow
was 1.4.0. To quickly check which version you have installed, type in:
140
CHAPTER 8. TENSORFLOW 141
1 import tensorflow as tf
2 print("Your TF version is", tf.__version__)
1 [’/cpu:0’]
But if you have a GPU available the result could also be:
1 [’/cpu:0’, ’/gpu:0’]
Types of tensors
In TensorFlow there are three different types of tensors.
tf.Variable is a tensor that is used to store a parameter value, e.g., a single weight,
a vector of weights or a whole weight matrix that will adapted during a training
procedure:
1 a = tf.Variable(3, name=’var1’)
2 b = tf.Variable([1.0,2.0,3.0], name=’var2’)
3 c = tf.Variable([[1,0],[0,1]], name=’var3’)
4 print(a)
5 print(b)
CHAPTER 8. TENSORFLOW 142
6 print(c)
Output:
Note that the print() calls does not output the value of the tensor, but type in-
formation. The name which we define in the variable is shown in the output and
augmented by a ”:0”. What does this mean? It means that this variable ”lives” in
graph 0. TensorFlow allows us to construct several computations graphs.
tf.constant is another tensor type and used for storing constant values:
1 d = tf.constant(3.14159, name=’pi’)
Output:
tf.placeholder is the third tensor type and used as a placeholder for data that
will be filled in later, e.g., during training for an input image:
Output:
1 a = tf.Variable(3.0)
2 b = tf.Variable(4.0)
3 c = tf.multiply(a, b)
4 print(c)
5 with tf.Session() as my_session:
6 my_session.run(tf.global_variables_initializer())
7 resulting_tensor = my_session.run(c)
8 print("resulting tensor=",resulting_tensor)
1 a = tf.placeholder(tf.float32)
2 b = tf.placeholder(tf.float32)
3 adder_node = a + b # shortcut for tf.add(a, b)
4 with tf.Session() as my_session:
5 my_session.run(tf.global_variables_initializer())
6 print(my_session.run(adder_node, {a:3, b:4} ))
7 print(my_session.run(adder_node, {a: [1,2], b: [3,4]} ))
8 print(my_session.run(adder_node, {a: [[1, 1],[1,1]], b: [[1, 1],[0,1]]}
))
1 7.0
2 [ 4. 6.]
3 [[ 2. 2.]
4 [ 1. 2.]]
1 a = tf.Variable(2)
2 b = tf.Variable(5)
3 c = tf.multiply(a,b)
4 d = tf.add(a,c)
5 print(d)
CHAPTER 8. TENSORFLOW 144
1 a = tf.Variable(3)
2 b = tf.Variable(15, name="variable-b")
3 saver = tf.train.Saver()
4 print("type of save is ",type(saver))
5 with tf.Session() as my_session:
6 my_session.run(tf.global_variables_initializer())
7 print("a:", my_session.run(a))
8 print("b:", my_session.run(b))
1 a = tf.Variable(0)
2 b = tf.Variable(0, name="variable-b")
3 saver = tf.train.Saver()
4 print("type of save is ",type(saver))
5 with tf.Session() as my_session:
6 saver.restore(my_session, "V:/tmp/my_model.ckpt")
7 print("a:", my_session.run(a))
8 print("b:", my_session.run(b))
2 a: 3
3 b: 15
Note that it is important for restoring variables that we use the same computation
graph. Let’s see what happens if we just change the name of the second variable:
1 a = tf.Variable(0)
2 b = tf.Variable(0, name="variable-c")
3 saver = tf.train.Saver()
4 print("type of save is ",type(saver))
5 with tf.Session() as my_session:
6 saver.restore(my_session, "V:/tmp/my_model.ckpt")
7 print("a:", my_session.run(a))
8 print("b:", my_session.run(b))
1 NotFoundError (see above for traceback): Key variable-c not found in checkpoint
1 a = tf.Variable(3, name="var-a")
2 b = tf.Variable(4, name="var-b")
3 c = tf.Variable(5, name="var-c")
4 d = tf.multiply(a,b, name="op-multiply")
5 e = tf.add(c,d, name="op-add")
6 with tf.Session() as my_session:
7 my_session.run(tf.global_variables_initializer())
8 print (my_session.run(d))
9 fw = tf.summary.FileWriter("V:/tmp/summary", my_session.graph)
1 import random
2
3 # the value of a will be incremented by some
4 # placeholder value
5 a = tf.Variable(42, name="var-a")
6 rndnumber_placeholder = \
7 tf.placeholder(tf.int32, shape=[], name="rndnumber_placeholder")
8 update_node = tf.assign(a,tf.add(a, rndnumber_placeholder))
9
10 # create a summary to track value of a
CHAPTER 8. TENSORFLOW 148
11 tf.summary.scalar("Value-of-a", a)
12
13 # in case we want to track multiple summaries
14 # merge all summaries into a single operation
15 summary_op = tf.summary.merge_all()
16
17 with tf.Session() as my_session:
18 my_session.run(tf.global_variables_initializer())
19 fw = tf.summary.FileWriter("V:/tmp/summary", my_session.graph)
20
21 # generate random numbers that are used
22 # as values for the placeholder
23 for step in range(500):
24
25 rndnum = int(-10 + random.random() * 20)
26 new_value_of_a = \
27 my_session.run(update_node,
28 feed_dict={rndnumber_placeholder: rndnum})
29
30 print("new_value_of_a=", new_value_of_a)
31
32 # compute summary
33 summary = my_session.run(summary_op)
34
35 # add merged summaries to filewriter,
36 # this will save the data to the file
37 fw.add_summary(summary, step)
CHAPTER 8. TENSORFLOW 149
30
31 # 3.1 Now define what to optimize at all:
32 # here we want to minimize the SSE.
33 # This is our "loss" of a certain line model
34 # Note: this is just another node in the
35 # computation graph that computes something
36 loss_func = tf.reduce_mean(tf.square(y - y_data))
37
38 # 3.2 We can use different optimizers in TF for model learning
39 my_optimizer = tf.train.GradientDescentOptimizer(0.5)
40
41 # 3.3 Tell the optimizer object to minimize the loss function
42 train = my_optimizer.minimize(loss_func)
43
44 with tf.Session() as my_session:
45
46 my_session.run(tf.global_variables_initializer())
47
48 # 4. Print inial value of W and b
49 print("\n")
50 print("initial W", my_session.run(W))
51 print("initial b", my_session.run(b))
52
53 # 5. Do 201 gradient descent steps...
54 print("\n")
55 for step in range(201):
56
57 # Do another gradient descent step to come to a better
58 # W and b
59 my_session.run(train)
60
61 # From time to time, print the current value of W and b
62 if step % 10 == 0:
63 print(step, my_session.run(W), my_session.run(b))
1 initial W [ 0.28849196]
2 initial b [ 0.]
3
4
5 0 [ 0.87854469] [ 1.10748446]
6 10 [ 1.0283134] [ 0.78003669]
7 20 [ 1.14134669] [ 0.72459263]
8 30 [ 1.19241428] [ 0.69954348]
9 40 [ 1.21548605] [ 0.68822652]
10 50 [ 1.22590971] [ 0.68311363]
11 60 [ 1.23061907] [ 0.68080366]
12 70 [ 1.2327466] [ 0.67976004]
13 80 [ 1.2337079] [ 0.67928857]
14 90 [ 1.23414207] [ 0.67907554]
15 100 [ 1.23433828] [ 0.67897934]
CHAPTER 8. TENSORFLOW 151
1 ’’’
2 Minimalistic exmaple of a MLP in TensorFlow
3 ’’’
4
5 import numpy as np
6 from data_generator import data_generator
7 import tensorflow as tf
8 import cv2
9 from timeit import default_timer as timer
10
11
12 # test data parameters
13 WINSIZE = 600
14 NR_CLUSTERS = 5
15 NR_SAMPLES_TO_GENERATE = 10000
16
17 # MLP parameters
18 NR_EPOCHS = 2000
19
20 # for RELU transfer function use smaller learn rate
21 # than for logistic transfer function
22 # Also use more hidden neurons! (e.g. 2-30-12-2)
23 #LEARN_RATE = 0.1
24
25 # for logistic transfer function
26 LEARN_RATE = 0.5
27
28 MINI_BATCH_SIZE = 100
29 NR_NEURONS_INPUT = 2
30 NR_NEURONS_HIDDEN1 = 10 # nr of neurons in 1st hidden layer
31 NR_NEURONS_HIDDEN2 = 6 # nr of neurons in 2nd hidden layer
32 NR_NEURONS_OUTPUT = 2
33
34 # store 2D weight matrices & 1D bias vectors for all
CHAPTER 8. TENSORFLOW 152
185
186 # now get the predicted class from the output
187 # vector
188 class_label = 0 if output_vec[0] > output_vec[1] else 1
189
190 # map class label to a color
191 color = COLOR_CLASS0 if class_label == 0 else COLOR_CLASS1
192
193 # draw circle
194 sample_coord = (int(input_vec[0] * WINSIZE),
195 int(input_vec[1] * WINSIZE))
196 cv2.circle(img, sample_coord, RADIUS_SAMPLE, color)
197
198
199
200 # 4. finaly show the image
201 cv2.rectangle(img, (WINSIZE - 120, 0), (WINSIZE - 1, 20),
202 (255, 255, 255), -1)
203 font = cv2.FONT_HERSHEY_SIMPLEX
204 cv2.putText(img,
205 "epoch #" + str(epoch_nr).zfill(3),
206 (WINSIZE - 110, 15), font, 0.5, (0, 0, 0), 1,
207 cv2.LINE_AA)
208 cv2.imshow(’Decision boundaries of trained MLP’, img)
209 c = cv2.waitKey(1)
210
211 # 5. save that image?
212 if False:
213 filename = "V:/tmp/img_{0:0>4}".format(image_counter)
214 image_counter +=1
215 cv2.imwrite(filename + ".png", img)
216
217
218 def build_TF_graph():
219
220 # 1. prepare placeholders for the input and output values
221
222 # the input is a 2D matrix:
223 # in each row we store one input vector
224 x_in = tf.placeholder("float")
225
226 # the output is a 2D matrix:
227 # in each row we store one output vector
228 y_out = tf.placeholder("float")
229
230 # 2. now the use helper function defined before to
231 # generate a MLP
232 mlp_output_vec = multilayer_perceptron(x_in, weights, biases)
233
234 # 3. define a loss function
CHAPTER 8. TENSORFLOW 156
283
284 # a) generate list of indices
285 sample_indices = np.arange(0, NR_SAMPLES)
286
287 # b) shuffle the indices list
288 sample_indices = np.random.shuffle(sample_indices)
289
290 # c) now prepare a matrix
291 # with one sample input vector in each row and
292 # another matrix with the corresponding desired
293 # output vector in each row
294 input_matrix =\
295 np.zeros((MINI_BATCH_SIZE, NR_NEURONS_INPUT))
296 output_matrix =\
297 np.zeros((MINI_BATCH_SIZE, NR_NEURONS_OUTPUT))
298 startpos = mini_batch_nr * MINI_BATCH_SIZE
299 row_counter = 0
300 for next_sample_id in \
301 range(startpos, startpos + MINI_BATCH_SIZE):
302 # get next training sample from dataset class
303 # the dataset is a list of lists
304 # in each list entry there are two vectors:
305 # the input vector and the output vector
306 next_sample = data_samples[next_sample_id]
307
308 # get input and output vector
309 # (which are both NumPy arrays)
310 input_vec = next_sample[0]
311 output_vec = next_sample[1]
312
313 # copy input vector to respective
314 # row in input matrix
315 input_matrix[row_counter, :] = input_vec
316
317 # copy output vector respective
318 # row in output matrix
319 output_matrix[row_counter, :] = output_vec
320
321 row_counter += 1
322
323 # d) run the optimizer node --> training will happen
324 # now the actual feed-forward step and the
325 # computations will happen!
326 _, curr_loss = my_session.run(
327 [optimizer, loss],
328 feed_dict={x_in: input_matrix,
329 y_out: output_matrix})
330
331 # print("current loss for mini-batch=", curr_loss)
332
CHAPTER 8. TENSORFLOW 158
9.1 Introduction
We are now ready to talk about the most important Deep Learning model: the Con-
volutional Neural Network (CNN).
Some recent history: AlexNet. The breakthrough of the CNN model started
with a paper in 2012. In their paper ”ImageNet Classification with Deep Convo-
lutional Neural Networks” by Alex Krizhevsky, Ilya Sutskever and Geoffrey Hinton
[24], the authors used a CNN and reached a never seen before performance on the
ImageNet Large Scale Visual Recognition Challenge of the year 2012 (ILSVRC 2012).
They trained their 8 layer model which has 60 million parameters and consists of
650.000 neurons on two GPUs which took six days for the 1,2 million training images
of the competition. The test data set consisted of 150.000 photographs which had
to be classified into one of 1000 possible classes. The authors achieved a top-5 error
rate of 15.3%, compared to an error rate of 26.2% of a ”classical approach” which
used SIFT features described by Fisher Vectors (FV). For this, the new model was
considered a quantum leap for image classification tasks and the model is now simply
159
CHAPTER 9. CONVOLUTIONAL NEURAL NETWORKS 160
Structure of a CNN: the input side. Let’s walk through a CNN in order to get
to know its structure (see Fig. 9.1). On the input side there is the input image. If
it is a color image, it is typically represented as a 3 channel matrix. In TensorFlow’s
terminology it is just a 3D tensor of dimension width × height × nr-of-channels. If
it is a gray image? It would be a 3D tensor as well with dimension width × height × 1.
Convolutional Layers. In section 4.5 we have argued that neurons can be consid-
ered as filters since the activity of a neuron is just the inner product of the input vector
and its weights. In the context of CNNs we can now talk about neurons or directly
call them filters. In a convolutional layer there are a lof of these filters used. But
compared to a MLP there are two important differences. First, the filters do not get
their input from all the output values from the previous layer, but from just a small
region, which is called receptive field (RF). This idea was stolen from the biological
role model! And it is important to know that in a convolutional layer, there are used
many different filters for the same receptive field. In the illustration there are e.g. 16
different filters used for each RF. Second, the filters within one depth slice, also called
feature map share their weights! And this idea is of high importance.
The concept of weight sharing. The reason for introducing weight sharing is that
for learning local feature detectors it is not important where the features are in the
image. We want to be able to detect vertical lines also in the right bottom corner even
if we have seen such a image structure only in the top left corner during training. So
weight sharing reduces the amount of training data we need to see. We do not need
a training data set where each image structure is shown in each receptive field. And
this in turn reduces training time. Another important reason is that weight sharing
dramatically reduces the number of parameters of the model (= of the CNN) that we
have to adapt.
How RFs and weight sharing help: A sample calculation. Imagine we would
use a fully connected approach without weight sharing. Each of the 200x200x16=640.000
neurons in the first conv layer in Fig. 9.1 would get input from 200x200x3=120.000
input values, i.e., there would be 76.800.000.000 weights already in the first layer! Now
compare it to the approach of using RFs and weight sharing. There are 16 filters in
the first conv layer. Each filter gets its input from a small RF of size 5x5x3=75. So
there are only 75*16=1200 weights.
Intuition: What does a conv layer? As we said in section 4.2 neurons cannot
only be considered as filters, but can also be regarded as feature detectors. So the
CHAPTER 9. CONVOLUTIONAL NEURAL NETWORKS 161
Figure 9.1: Structure of a Convolutional Neural Network (CNN). The CNN takes an
input tensor, e.g., an input image of dimension 200x200 with 3 color channels and de-
tects local features using filters in convolutional layers (CONV). The ReLU layer maps
the filter responses (neuron activities) non-linearly to neuron output values. Pooling
layers are inserted into the processing pipeline in order to reduce the dimensionality of
the tensors and help to detect patterns invariant to translations and rotations. Finally
a classifier, e.g., a MLP, is used to map high-level features to output values.
neurons in a conv layer analyze the spatial input in a small region of the image and
their activation value (= filter response value) represents a measure of similarity be-
tween the feature the neuron represents and the input values that are currently present.
ReLU layer. In section 7.10 we have seen that a MLP with just linear transfer func-
CHAPTER 9. CONVOLUTIONAL NEURAL NETWORKS 162
tions in all hidden layers cannot be used to classify non-linearly separable datasets.
The same problem would occur in a CNN. For this, it is important to use a non-linear
transfer function in a CNN as well. This is the task of the next layer, the ReLU layer.
Very good results have been presented for CNNs with the ReLU transfer function and
since it is also fast to compute, most often the ReLU is used as transfer function for
the neuron activities computed in the conv layer. Note that often the (ReLU) transfer
function computation is considered as a part of the conv layer as well. Also note that
the ReLU layer does not change the dimensions of the input tensor. If the input tensor
has dimension W × H × D, then the output tensor produced by the ReLU layer has
the same dimensions.
Pooling layers. While the ReLU layers do not change the dimensions of a tensor,
the task of the Pooling layer is to reduce the spatial dimensions of it. An operation
called max pooling is applied to small receptive fields (normally of size 2x2) on every
depth slice of the input tensor. This does not only reduce compute time since the
input tensor for the next conv layers is smaller, but also allows the CNN to analyze
larger and larger image areas while the RF size in the conv layer is kept constant.
The effect of the max pooling operation is depicted in the top right corner of Fig. 9.1.
Since a single max pooling operation means that we do not care where a certain strong
feature response was within a 2x2 area and since normally many pooling layers are
inserted from time to time within the processing pipeline in a CNN, we can achieve
to some aspect translational and rotational invariant pattern recognition.
Feature hierarchy vs. classificator. The sequence of conv, ReLU and pool layers
establishes as feature hierarchy. At the end of the feature hierarchy a classifiers can
be used to exploit the nice high-level features which are invariant to many variances
present in the input data. Note, that we are not forced to use a MLP as classifier
here. And indeed, other classifiers have been used successfully in combination with
the feature hierarchy, e.g., Support Vector Machines (SVM) [41].
CHAPTER 9. CONVOLUTIONAL NEURAL NETWORKS 163
Figure 9.2: How a CNN realizes rotational invariance. Imagine the strongest feature
response in a 4x4 area is A (and F>E). Then it does not matter where the A is on
the shaded fields. After two max pooling steps it will always be chosen as the final
value. In this example we assume that all values depicted in the 2x2 areas are larger
than the values not depicted in each 2x2 area (in the white non-shaded fields). Note,
that A will always be chosen as the final value, even if it stood in one of the white
non-shaded fields! So this 4x4 translational invariance goes hand in hand with the
rotational invariance depicted.
1975: The Cognitron model. A first step towards the CNN model were the
works by the Japanese researcher Kunihiko Fukushima. Fukushima experimented
with neural network models that were trained using unsupervised learning rules, i.e.,
the synaptic weights were changed just based on input data. E.g., in his 1975 paper
”Cognitron: A self-organizing multilayered neural network” [13] Fukushima tried out
a simple idea for the adaptation of synapses between neurons:
In this study Fukushima came to the result that the neurons organize themselves such
that they will selectively respond to different input patterns after training:
However, Fukushima already understood in 1975 that another part was missing,
namely a classifier that works on top of the features learned by the cognitron:
1980: The Neocognitron model. Some years later in 1980 Fukushima published
a new model, which was called ”Neocognitron”, since it was a new (”neo”) version of
the Cognitron model. The paper [14] was called:
The title already tells us a lot about the difference compared to the Cognitron model.
The Cognitron model had the ability to respond with different neurons specifically
to certain input patterns, but unfortunately its response was dependent upon the
location of the pattern in the input image
”That is, the same patterns which were presented at different positions
were taken as different patterns by the conventional cognitron. In the
Neocognitron proposed here, however, the response of the network is lit-
tle affected by the position of the stimulus patterns.”
How did Fukushima achieve this translation invariant pattern recognition? His Neocog-
nitron architecture was inspired by the neuroscientific findings of Hubel and Wiesel:
Hubel and Wiesel did a famous experiment in 1959. They inserted a microelectrode
into the primary visual cortex of an anesthetized cat. Then rectangular patterns of
light were projected onto a screen in front of the cat. By recording the firing rates of
some neurons it could be observed that on the one hand some neurons only responded
when the pattern was presented at a certain angle in a certain region in the visual
field of the cat. These neurons were called ”simple cells”. On the other hand, neurons
were observed which could detect lines independently of where they were presented
in the visual field. These cells were named ”complex cells”. Fukushima directly build
his work on top of these findings by using a cascade of alternating layers of S- and
C-cells, where S-cells were used to detect features and C-cells were used to introduce
translation invariance. Note, that the concept of weight sharing was already used in
the Neocognitron model (!): neurons in a ”cell plane” (today: ”feature map”) shared
their weights.
1990s: The LeNet model. Fukushima’s Cognitron and Neocognitron models used
a feature hierarchy as the CNN models use today, but both were trained unsupervised
However, there was also another model that established a hierarchy of features within
a training phase and that was trained in a supervised fashion using Backpropagation:
the LeNet model. The first model, LeNet1 was developed between 1988 and 1993 in
the Adaptive System Research Department at Bell Labs. It was published in a paper
by Yann LeCun (LeCun spoken with nasalized French ”un”) in 1989:
A demo of the LeNet1 model in action with a young Yann LeCun can be watched in
the ”Convolutional Network Demo from 1993” video at YouTube:
https://www.youtube.com/watch?v=FwFduRA_L6Q
This new model was very successful as Yann LeCun writes on YouTube:
Shortly after this demo was put together, we started working with a devel-
opment group and a product group at NCR (then a subsidiary of AT&T).
NCR soon deployed ATM machines that could read the numerical amounts
CHAPTER 9. CONVOLUTIONAL NEURAL NETWORKS 166
on checks, initially in Europe and then in the US. The ConvNet was run-
ning on the DSP32C card sitting in a PC inside the ATM. Later, NCR
deployed a similar system in large check reading machines that banks use
in their back offices. At some point in the late 90’s these machines were
processing 10 to 20% of all the checks in the US.
Later, in 1998, LeNet5 was published in another paper by LeCun et al. [28]: A 7-layer
Convolutional Neural Network that consists of convolutional and subsampling layers
(which are today called: ”pooling layers”). The authors write in their introduction:
The main message of this paper is that better pattern recognition systems
can be built by relying more on automatic learning, and less on hand-
designed heuristics. [...] Using character recognition as a case study, we
show that hand-crafted feature extraction can be advantageously replaced by
carefully designed learning machines that operate directly on pixel images.
Using document understanding as a case study, we show that the traditional
way of building recognition systems by manually integrating individually
designed modules can be replaced by a unified and well-principled design
paradigm, called Graph Transformer Networks, that allows training all the
modules to optimize a global performance criterion.
With this they underlined a very important change in the field of pattern recognition
that now really is happening: the approach of hand-crafting features is vanishing more
and more while learning good features (for the classification task or whatever the task
is) from large amount of data is now the dominating approach. And the authors
postulated this in 1998, while the change only started in 2012! Note, that TensorFlow
with its computation graph approach could be called a Graph Transformer Network:
We define a model or a processing pipeline where each transformation step is trainable
in order to fulfill a global performance criterion.
• These two test images are then put into one 4D tensor since both the convolution
and the pooling operators in TensorFlow expect 4D tensors as input! Note, that
each color image is already a 3D tensor, so we need one more (array-)dimension
CHAPTER 9. CONVOLUTIONAL NEURAL NETWORKS 167
• In the conv2d demo() function the complete mini-batch will be convolved with
a set of filters. For this, a 4D filter tensor filters has to be generated of shape
(filter-height, filter-width, filter-depth, nr-of-filters). Then a TensorFlow graph
is generated where X is a placeholder for a 4D tensor and will later be filled in
a computation session with a mini-batch. The actual convolution operator is
generated with
where X is the placeholder for the input tensor, filters is the 4D tensor of
filters and strides describes the filter step size in each dimension. Here a new
filter position is set for the two 5x5x3 filters by striding two pixels to the right
or two pixels to the bottom respectively. The padding string has to be set to
”SAME” or ”VALID” and determines whether the border of the image shall be
padded with zeros (”SAME”) in order to realize that the output tensor has the
same spatial dimensions or whether it is ok, when it has smaller spatial dimen-
sions and no zero padding shall be used (”VALID”). We will explain this later
in detail.
The actual computation of the convolutions can be started by running the con-
volution operator node convop using
convresult = my_session.run(convop,
feed_dict={X:minibatch})
Here we feed the placeholder 4D tensor X with the 4D minibatch tensor and
thereby let TensorFlow convolve each of the two images with each of the two
filters defined before in the 4D filters tensor. The resulting tensor is a 4D
tensor as well, where the first dimension addresses the test image, the second and
third dimension are the spatial dimensions and the fourth dimension is for the
filter-nr (”feature map nr”). In the example code we retrieve the resulting two
feature maps for the first image, normalize the resulting values to the interval
[0,1] and then display the results using OpenCV’s imshow() function. Why
the normalization? The reason is, that OpenCV’s imshow() function expects
the values to be in [0,1] if the NumPy array values are float values.
CHAPTER 9. CONVOLUTIONAL NEURAL NETWORKS 168
Figure 9.3: Resulting feature maps after convolving a test image with two different
pre-defined filters. Note that the filter responses in the left feature map are high
(white pixels), if there is a vertical image structure while they are high in the right
feature map if there are horizontal image structures.
CHAPTER 9. CONVOLUTIONAL NEURAL NETWORKS 170
hyperparameters.
Figure 9.4: There are three hyperparameters to be defined for a convolution layer. 1.)
Do we want to use padding? 2.) How to stride the receptive fields? 3.) How many
feature maps do we want to learn?
1.) Padding. The first important question that we have to answer is whether we
want to process the image information on the borders of the image as well if the next
RF position does not fit completely into the image dimensions. The usual way to deal
with such a case is to fill or to ”pad” the missing entries for these border cases with
zeros. In Fig. 9.5 I have illustrated the two different options ”VALID” (do not use
zero padding) and ”SAME” (use zero padding), that can be specified when generating
a convolution operator with conv2d()
2.) Receptive field (RF) stride. We have also to define where we want to put
receptive fields into the input tensor. Some authors use overlapping receptive fields,
where the filter stride is smaller than the filter width. If we set the stride equal to the
filter width, we get non-overlapping receptive fields. However, we could also follow the
approach of ”sparse sampling”, where we skip some of the input values and set the
filter stride larger than the filter width. Note, that we could even use different filter
CHAPTER 9. CONVOLUTIONAL NEURAL NETWORKS 174
widths and filter strides for the two spatial dimensions. In certain special applications
where we want to encode e.g. the horizontal information in more detail compared to
the vertical information it might be helpful to make use of this freedom. However, up
to now, I have never seen a CNN that makes use of this freedom. What does nature
do? We can find many studies that show that biological neurons often have overlap-
ping receptive fields. E.g., retinal ganglion cells (RGC) show overlapping receptive
fields.
3.) Number of filters. Another important question is how many filters do we want
to use. Note that the number of filters is the same as the depth of the output tensor.
Remember that for each filter the convolution layer will compute the response at each
RF location. So the more filters we will use, the more computation time we will need
for a feedforward step but also for the training step, since we need to adapt more filter
weights.
A thought experiment: Using only one feature map. Currently there is no real
theory that could guide the choice for the number of filters. However, let us assume we
would use only 1 filter in a convolution layer and that the receptive field size is 5x5x3
(5 pixels in width, 5 pixels in height, 3 = number of color channels). There could be
a lot of different spatial colored image structures in a 5x5x3 receptive field if there are
256 different values for each of the 5x5x3 entries. But with only 1 filter we can only
analyze whether one typical colored image structure is present or not. Let’s say the
filter weights are set such that the filter ”likes” vertical white structures (with ”likes” I
mean that it produces high filter response values for vertical white structures). In this
case, it produces a high output and a low output if there is no vertical structure. So
the feature map encodes the image from another viewpoint: How much ”verticalness”
is there in each RF? But that is all. With only one filter we could not distinguish
other important (colored) image structures.
you write a ”1” with a slightly rotated vertical line! For this, it makes sense not to use
such a high number of filters, but to use a substantially smaller number of filters in
order to put different inputs into the same equivalence class. Only by this means our
output tensor becomes a new image representation that will be more robust against
slight changes in the input data. For these reasons people typically choose the num-
ber of filters between 10-1000 in a convolution layer. Note, that we assume here when
using a small set of features, that the features are not ”similar”. It will not help us
neither if we only use 100 filters but all of the 100 filters describe minimal different
variants of vertical structures. For this it is interesting to see that a CNN learns a set
of features in each conv layer that are different while we do not directly foster this in
the learning process by using Backpropagation.
The question of how many features to use is an old one. Actually this ques-
tion is an old one which emerged in computer vision many decades before in the era
where hand-crafted feature detectors and descriptor vectors like SIFT and SURF were
used. Examples of these 64- and 128-dimensional descriptor vectors computed on ba-
sis of local keypoints sampled from natural images were often clustered to build up a
codebook of N different feature prototypes (codewords) and during a classification step
local image features found on a test image were mapped onto one of these codewords.
And then you could learn how many codewords of each type you find if a certain
object class is present (bag of features, bag of words approach). Again the question
was here: How large should my codebook be, so how to choose N? A small codebook
maps very different image structures to the same few codewords. It is very robust
against changes in the input image structures, but shows only a poor discriminative
performance. A large codebook maps different image structures to mostly different
codewords. It shows a high discriminative performance, but only a poor robustness
against minor changes of these image structures. So choosing a ”good” codebook size
N was always a trade-off between keeping discriminative image structure information
on the one side and establishing invariance performance against changes in the input
image structures. For this, similar to the question of how many feature maps to chose
nowadays in convolution layers, authors chose codebook sizes of some ten to hundreds
of codewords as a trade-off between these two goals (invariance vs. discriminative
performance).
CHAPTER 9. CONVOLUTIONAL NEURAL NETWORKS 176
Figure 9.5: There are two possible choices for dealing with border cases in TensorFlow.
The first one is not to use zero padding at all (”VALID”). The second one is to use
zero padding (”SAME”). In this case TensorFlow tries to pad evenly left and right.
But if the amount of columns to be added is odd, TensorFlow adds the extra column
to the right.
CHAPTER 9. CONVOLUTIONAL NEURAL NETWORKS 177
Computation of the spatial dimension. Have a look at Fig. 9.6. Here we assume
that our input tensor has a depth of 1 and a spatial dimension of W × W . The filter
is assumed to have size F × F and the stride is set to S. We further assume a padding
width of P at the borders of the input tensor. Then we have W + 2P possible posi-
tions in one row where we could put the origin of the filter, right? However, the last
positions will not allow to fit the filter completely into the image since it has some
extension as well. For this, there are only (W + 2P − F ) + 1 positions such that the
filter of size F completely fits into the image padded to both sides with P zeros. Now
this computation assumes a stride S = 1. For a larger filter stride we have to divide
the number of possible filter origins by the the stride size: X = (W + 2P − F )/S + 1.
How to keep the spatial dimensions of input and output tensors the ”SAME”.
Imagine we want the spatial dimensions of our output tensor after the convolution op-
eration to be the same as the spatial dimensions of the input tensor, i.e., for one
dimension we want X (spatial dimension of output tensor) and W (spatial dimension
of input tensor) to be equally. In the following computation we further assume that
S = 1 (usually chosen):
D = W (9.1)
W + 2P − F
⇔ +1 = W (9.2)
S
⇔ W + 2P − F + 1 = W (9.3)
⇔ 2P − F + 1 = 0 (9.4)
⇔ 2P = F −1 (9.5)
F −1
⇔P = (9.6)
2
So we need to choose the padding size P = F −12 in order to assume that the output
tensor has the same spatial dimension as the input tensor.
CHAPTER 9. CONVOLUTIONAL NEURAL NETWORKS 178
After all mini-batches have been trained the final model will be saved to a file with
the filename save filename. This model can then be restored in order for testing
(inference) using the file test cnn.py.
The file dataset reader.py is used for easier access to the training data. The
function nextBatch() can be called for retrieving the next mini-batch.
19 # <dataset_root>/test/cars
20 #
21 # Then let the script run.
22 # The final trained model will be saved.
23
24 from dataset_reader import dataset
25 import tensorflow as tf
26
27 # Experiments
28 # Exp-Nr Comment
29 # 01 10 hidden layers:
30 # INPUT->C1->P1->C2->P2->C3->C4->C5->P3->FC1->FC2->OUT
31 # trained 100 mini-batches of size 32
32 # feature maps: 10-15-20-25-30
33 #
34 # 02 10 hidden layers:
35 # INPUT->C1->P1->C2->P2->C3->C4->C5->P3->FC1->FC2->OUT
36 # trained 1000 mini-batches of size 32
37 # feature maps: 10-15-20-25-30
38 #
39 # 03 4 hidden layers: INPUT->C1->P1->FC1->FC2->OUT
40 # trained 1000 mini-batches of size 32
41 # feature maps: 10
42
43 exp_nr = 1
44 dataset_root =\
45 "V:/01_job/12_datasets/imagenet/cars_vs_bikes_prepared/"
46
47
48 # helper function to build 1st conv layer with filter size 11x11
49 # and stride 4 (in both directions) and no padding
50 def conv1st(name, l_input, filter, b):
51 cov = tf.nn.conv2d(l_input, filter,
52 strides=[1, 4, 4, 1], padding=’VALID’)
53 return tf.nn.relu(tf.nn.bias_add(cov, b), name=name)
54
55
56 # in all other layers we use a stride of 1 (in both directions)
57 # and a padding such that the spatial dimension (width,height)
58 # of the output volume is the same as the spatial dimension
59 # of the input volume
60 def conv2d(name, l_input, w, b):
61 cov = tf.nn.conv2d(l_input, w,
62 strides=[1, 1, 1, 1], padding=’SAME’)
63 return tf.nn.relu(tf.nn.bias_add(cov, b), name=name)
64
65 # generates a max pooling layer
66 def max_pool(name, l_input, k, s):
67 return tf.nn.max_pool(l_input,
68 ksize=[1, k, k, 1],
CHAPTER 9. CONVOLUTIONAL NEURAL NETWORKS 180
69 strides=[1, s, s, 1],
70 padding=’VALID’, name=name)
71
72
73 # helper function to generate a CNN
74 def build_cnn_model(_X, keep_prob, n_classes, imagesize, img_channel):
75 # prepare matrices for weights
76 _weights = {
77 ’wc1’: tf.Variable(tf.random_normal([11, 11, img_channel, 10])),
78 ’wc2’: tf.Variable(tf.random_normal([5, 5, 10, 15])),
79 ’wc3’: tf.Variable(tf.random_normal([3, 3, 15, 20])),
80 ’wc4’: tf.Variable(tf.random_normal([3, 3, 20, 25])),
81 ’wc5’: tf.Variable(tf.random_normal([3, 3, 25, 30])),
82 ’wd1’: tf.Variable(tf.random_normal([6 * 6 * 30, 40])),
83 ’wd2’: tf.Variable(tf.random_normal([40, 40])),
84 ’out’: tf.Variable(tf.random_normal([40, n_classes])),
85 ’exp3_wd1’: tf.Variable(tf.random_normal([27 * 27 * 10, 40]))
86 }
87
88 # prepare vectors for biases
89 _biases = {
90 ’bc1’: tf.Variable(tf.random_normal([10])),
91 ’bc2’: tf.Variable(tf.random_normal([15])),
92 ’bc3’: tf.Variable(tf.random_normal([20])),
93 ’bc4’: tf.Variable(tf.random_normal([25])),
94 ’bc5’: tf.Variable(tf.random_normal([30])),
95 ’bd1’: tf.Variable(tf.random_normal([40])),
96 ’bd2’: tf.Variable(tf.random_normal([40])),
97 ’out’: tf.Variable(tf.random_normal([n_classes]))
98 }
99
100 # reshape input picture
101 _X = tf.reshape(_X, shape=[-1, imagesize, imagesize, img_channel])
102
103 if (exp_nr == 1 or exp_nr==2):
104
105 # feature hierarchy:
106 # topology:
107 # INPUT->C1->P1->C2->P2->C3->C4->C5->P3->FC1->FC2->OUT
108 conv1 =\
109 conv1st(’conv1’, _X, _weights[’wc1’], _biases[’bc1’])
110 pool1 =\
111 max_pool(’pool1’, conv1, k=3, s=2)
112 conv2 =\
113 conv2d(’conv2’, pool1, _weights[’wc2’], _biases[’bc2’])
114 pool2 =\
115 max_pool(’pool2’, conv2, k=3, s=2)
116 conv3 =\
117 conv2d(’conv3’, pool2, _weights[’wc3’], _biases[’bc3’])
118 conv4 =\
CHAPTER 9. CONVOLUTIONAL NEURAL NETWORKS 181
169
170 out =\
171 tf.matmul(dense2, _weights[’out’]) + _biases[’out’]
172 print("out shape: ", out.get_shape())
173
174 return [out, _weights[’wc1’]]
175
176
177 # 1. create a training and testing Dataset object that stores
178 # the training / testing images
179 training = dataset(dataset_root + "train", ".jpeg")
180 testing = dataset(dataset_root + "validation", ".jpeg")
181
182 # 2. set training parameters
183 learn_rate = 0.001
184 batch_size = 32
185 display_step = 1
186 if exp_nr==1:
187 nr_mini_batches_to_train = 100
188 elif exp_nr==2:
189 nr_mini_batches_to_train = 1000
190 elif exp_nr==3:
191 nr_mini_batches_to_train = 1000
192
193 save_filename = ’save/model.ckpt’
194 logs_path = ’./logfiles’
195
196 n_classes = training.num_labels
197 dropout = 0.8 # dropout (probability to keep units)
198 imagesize = 227
199 img_channel = 3
200
201 x = tf.placeholder(tf.float32, [None,
202 imagesize,
203 imagesize,
204 img_channel])
205 y = tf.placeholder(tf.float32, [None, n_classes])
206 keep_prob = tf.placeholder(tf.float32) # dropout (keep probability)
207
208 [pred, filter1st] = build_cnn_model(x,
209 keep_prob,
210 n_classes,
211 imagesize,
212 img_channel)
213 cost = tf.reduce_mean(
214 tf.nn.softmax_cross_entropy_with_logits(logits=pred,
215 labels=y))
216 # cost = tf.reduce_mean(tf.squared_difference(pred, y))
217
218 global_step = tf.Variable(0, trainable=False)
CHAPTER 9. CONVOLUTIONAL NEURAL NETWORKS 183
219
220 optimizer = tf.train.AdamOptimizer(
221 learning_rate=learn_rate).minimize(
222 cost, global_step=global_step)
223 # optimizer = tf.train.GradientDescentOptimizer(lr).
224 # minimize(cost, global_step=global_step)
225
226 correct_pred = tf.equal(tf.argmax(pred, 1), tf.argmax(y, 1))
227 accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))
228
229 saver = tf.train.Saver()
230 tf.add_to_collection("x", x)
231 tf.add_to_collection("y", y)
232 tf.add_to_collection("keep_prob", keep_prob)
233 tf.add_to_collection("pred", pred)
234 tf.add_to_collection("accuracy", accuracy)
235
236 print("\n\n")
237 print("----------------------------------------")
238 print("I am ready to start the training...")
239 print("So I will train a CNN, starting with a"+
240 "learn rate of", learn_rate)
241 print("I will train ", nr_mini_batches_to_train,
242 "mini batches of ", batch_size, "images")
243 print("Your input images will be resized to ",
244 imagesize, "x", imagesize, "pixels")
245 print("----------------------------------------")
246
247 with tf.Session() as my_session:
248 my_session.run(tf.global_variables_initializer())
249
250 step = 1
251 while step < nr_mini_batches_to_train:
252
253 batch_ys, batch_xs = training.nextBatch(batch_size)
254 # note: batch_ys and batch_xs are tuples each
255 # batch_ys a tuple of e.g. 32 one-hot NumPy arrays
256 # batch_xs a tuple of e.g. 32 NumPy arrays of shape
257 # (width, height, 3)
258
259
260 _ = my_session.run([optimizer],
261 feed_dict={x: batch_xs,
262 y: batch_ys,
263 keep_prob: dropout})
264
265 if step % display_step == 0:
266 acc = my_session.run(accuracy,
267 feed_dict={x: batch_xs,
268 y: batch_ys,
CHAPTER 9. CONVOLUTIONAL NEURAL NETWORKS 184
31
32 sess = tf.Session()
33 saver.restore(sess, ckpt.model_checkpoint_path)
34
35 # test
36 step_test = 0
37 correct=0
38 while step_test * batch_size < len(testing):
39
40 # testing_ys and testing_xs are tuples
41 testing_ys, testing_xs = testing.nextBatch(batch_size)
42
43 # get first image and first ground truth vector
44 # from the tuples
45 first_img = testing_xs[0]
46 first_groundtruth_vec = testing_ys[0]
47
48 # at first iteration:
49 # show shape of image and ground truth vector
50 if step_test == 0:
51 print("Shape of testing_xs is :",
52 first_img.shape)
53 print("Shape of testing_ys is :",
54 first_groundtruth_vec.shape)
55
56 # given the input image,
57 # let the CNN predict the category!
58 predict = sess.run(pred,
59 feed_dict={x: testing_xs,
60 keep_prob: 1.})
61
62 # get ground truth label and
63 # predicted label from output vector
64 groundtruth_label = np.argmax(first_groundtruth_vec)
65 predicted_label = np.argmax(predict, 1)[0]
66
67 print("\nImage test: ", step_test)
68 print("Ground truth label :",
69 testing.label2category[groundtruth_label])
70 print("Label predicted by CNN:",
71 testing.label2category[predicted_label])
72
73 if predicted_label == groundtruth_label:
74 correct+=1
75 step_test += 1
76
77 print("\n---")
78 print("Classified", correct, "images correct of",
79 step_test,"in total.")
80 print("Classification rate in" +
CHAPTER 9. CONVOLUTIONAL NEURAL NETWORKS 186
81 "percent = {0:.2f}".format(correct/step_test*100.0))
97 self.label2category =\
98 {l: c for c, l in self.category2label.items()}
99
100 # 9. prepare list of ground truth labels
101 # where we can find the ground truth label for
102 # image i at the i-th position in the list
103 self.labels = [self.category2label[l] for l in self.labels]
104
105
106 ’’’
107 Returns the number of images
108 available by this dataset object
109 ’’’
110 def __len__(self):
111 return self.num_images
112
113 ’’’
114 Returns a onehot NumPy array,
115 where all entries are set to 0
116 but to 1 for the right category
117 ’’’
118 def onehot(self, label):
119 v = np.zeros(self.num_labels)
120 v[label] = 1
121 return v
122
123
124 ’’’
125 Are there further images available?
126 ’’’
127 def hasNextRecord(self):
128 return self.next_image_nr < self.num_images
129
130
131 ’’’
132 Resizes the specified OpenCV image to a fixed size
133 Converts it to a NumPy array
134 Converts the values from [0,255] to [0,1]
135 ’’’
136 def preprocess(self, img):
137
138 # preprocess image by resizing it to 227x227
139 pp = cv2.resize(img, (RESIZE_WIDTH, RESIZE_HEIGHT))
140
141 # and convert OpenCV representation to Numpy array
142 # note: asarray does not copy data!
143 # see
144 pp = np.asarray(pp, dtype=np.float32)
145
146 # map values from [0,255] to [0,1]
CHAPTER 9. CONVOLUTIONAL NEURAL NETWORKS 189
147 pp /= 255
148
149 # prepare array of shape width x height x 3 array
150 pp = pp.reshape((pp.shape[0], pp.shape[1], 3))
151 return pp
152
153
154 ’’’
155 Returns a (label, image) tuple
156 where label is a one-hot teacher vector (list)
157 e.g. [0,1] if there are two categories
158 and
159 image is a NumPy array
160 of shape (width, height, 3)
161 ’’’
162 def get_next_record(self):
163
164 # will return the next training pair
165 # consisting of the input image and a
166 # one-hot/teacher label vector
167 if not self.hasNextRecord():
168
169 # Ups! We are at the end of the image list!
170
171 # So generate new random order of images
172
173 # randomly shuffle the data again
174 np.random.shuffle(self.data)
175 self.next_image_nr = 0
176 self.labels, self.filenames = zip(*self.data)
177 category = np.unique(self.labels)
178 self.num_labels = len(category)
179 self.category2label =\
180 dict(zip(category, range(len(category))))
181 self.label2category =\
182 {l: c for c,
183 l in self.category2label.items()}
184
185 # prepare ground-truth label information for all images
186 # according to the newly shuffled order of the images
187 self.labels = [self.category2label[l] for l in self.labels]
188
189 # prepare one-hot teacher vector for the output neurons
190 label = self.onehot(self.labels[self.next_image_nr])
191
192 # read in the image using OpenCVs imread()
193 # function and then preprocess it
194 # (i.e., resize it, convert it to a NumPy array,
195 # convert values from [0,255] to [0,1])
196 img_filename = self.filenames[self.next_image_nr]
CHAPTER 9. CONVOLUTIONAL NEURAL NETWORKS 190
247 if category:
248 walkPath = os.path.join(imagePath, category)
249 else:
250 walkPath = imagePath
251 category = os.path.split(imagePath)[1]
252
253 # create a generator
254 w = _walk(walkPath)
255
256 # step through all directories and subdirectories
257 while True:
258
259 # get names of dirs and filenames of current dir
260 try:
261 dirpath, dirnames, filenames = next(w)
262 except StopIteration:
263 break
264
265 # don’t enter directories that begin with ’.’
266 for d in dirnames[:]:
267 if d.startswith(’.’):
268 dirnames.remove(d)
269
270 dirnames.sort()
271
272 # ignore files that begin with ’.’
273 filenames =\
274 [f for f in filenames if not f.startswith(’.’)]
275 # only load images with the right extension
276 filenames =\
277 [f for f in filenames
278 if os.path.splitext(f)[1].lower()
279 in extensions]
280 filenames.sort()
281
282 for f in filenames:
283 labels_and_filenames.append(
284 [category, os.path.join(dirpath, f)])
285
286 # labels_and_filenames will be a list of
287 # two-tuples [category, filename]
288 return labels_and_filenames
289
290
291 def _walk(top):
292 """
293 This is a (recursive) directory tree generator.
294 What is a generator?
295 See:
296 http://stackoverflow.com/questions/231767/what-does-the-yield-keyword-
CHAPTER 9. CONVOLUTIONAL NEURAL NETWORKS 192
do
297 In short:
298 - generators are iterables that can be iterated only once
299 - their values are not stored in contrast e.g. to a list
300 - ’yield’ is ’like’ return
301 """
302
303 # 1. collect directory names in dirs and
304 # non-directory names (filenames) in nondirs
305 names = os.listdir(top)
306 dirs, nondirs = [], []
307 for name in names:
308 if os.path.isdir(os.path.join(top, name)):
309 dirs.append(name)
310 else:
311 nondirs.append(name)
312
313 # 2. "return" information about directory names and filenames
314 yield top, dirs, nondirs
315
316 # 3. recursively process each directory found in current top
317 # directory
318 for name in dirs:
319 path = os.path.join(top, name)
320 for x in _walk(path):
321 yield x
CHAPTER 9. CONVOLUTIONAL NEURAL NETWORKS 193
Figure 9.6: How to compute the spatial dimension X of an output tensor that results
from a convolution operation.
10
Deep Learning Tricks
∂E
∆wkj = −α (10.1)
∂wkj
typically get smaller and smaller as we go from the higher layers of a deep neural
network down to the lower layers. This is called the vanishing gradient problem. The
Backpropagation algorithm propagates the error signals down to the lower layers. This
means that the weights in the lower layers of a deep neural network keep nearly un-
changed.
Reason for vanishing gradients. The reason of the vanishing gradient problem can
be traced back to the use of squashing activation functions such as the logistic and the
194
CHAPTER 10. DEEP LEARNING TRICKS 195
And the problem becomes even worse if we use several layers of neurons with squashing
transfer functions. The first layer in such a network will map a large region of the input
space to a small interval, the second layer in the network will map the resulting output
intervals to even smaller regions and so on. As a result, a change in the weights of
the first layer will not change much the final output or with other words: the gradient
of the error with respect to some weight stemming from one of the lower layers is small.
Avoiding vanishing gradients. So how can we tackle this problem? We can simply
avoid using squashing functions which show this saturating behavior to the left and
right side of the input space! One popular example is the ReLU activation function
which maps an input value x to f (x) = max(0, x).
Seeing the vanishing gradients. You will better believe the above argumentation
about vanishing gradients if you have seen them. How can we visualize the vanishing
gradients? We could compute the average gradient length or average weight change
per neuron layer (remember: the height of a weight change w is the scaled negative
gradient of the error function differentiated with respect to this variable w) in a deep
MLP and display these average weight change values. This is exactly what you shall
do in exercise 12 (solution can be found in the GitHub repository). Here are the
average weight change values of a 2-5-5-5-5-5-2 MLP, i.e., a deep neural network with
2 input neurons, 2 output neurons and 5 hidden layers. If we use the logistic transfer
function for the neurons in the hidden layer
1 my_mlp.add_layer(2, TF.identity)
2 my_mlp.add_layer(5, TF.sigmoid)
3 my_mlp.add_layer(5, TF.sigmoid)
4 my_mlp.add_layer(5, TF.sigmoid)
5 my_mlp.add_layer(5, TF.sigmoid)
6 my_mlp.add_layer(5, TF.sigmoid)
CHAPTER 10. DEEP LEARNING TRICKS 196
7 my_mlp.add_layer(2, TF.identity)
we get these average weight changes for weights to neurons to layer 1-6:
1 Layer # 1 : 0.000016467941743
2 Layer # 2 : 0.000069949781101
3 Layer # 3 : 0.000252670886581
4 Layer # 4 : 0.000760924226725
5 Layer # 5 : 0.002352968443448
6 Layer # 6 : 0.015992000875581
And if we use the ReLU transfer function for the neurons in the hidden layer
1 my_mlp.add_layer(2, TF.identity)
2 my_mlp.add_layer(5, TF.relu)
3 my_mlp.add_layer(5, TF.relu)
4 my_mlp.add_layer(5, TF.relu)
5 my_mlp.add_layer(5, TF.relu)
6 my_mlp.add_layer(5, TF.relu)
7 my_mlp.add_layer(2, TF.identity)
we get these average weight changes for weights to neurons to layer 1-6:
1 Layer # 1 : 0.000073625247850
2 Layer # 2 : 0.000064233540185
3 Layer # 3 : 0.000047251698986
4 Layer # 4 : 0.000073345206668
5 Layer # 5 : 0.000066667856138
6 Layer # 6 : 0.008097748184853
Can you see the systematics? In the first case (where we use a squashing function)
the weight changes become smaller and smaller if we go towards lower layers due to
fact that the gradients get smaller and smaller. In the second case (where we use the
ReLU) this systematics cannot be detected.
surface with a small slope into the direction of the next local minimum. Standard
Backpropagation with a fixed small learning rate will do a lot of small steps on such
plateaus.
∂E(w)
∆wkj = −α + β∆wkj (10.2)
∂wkj
= −α δj yk + β∆wkj (10.3)
∂E
So at each weight update step we do not only use the local gradient information ∂w kj
,
but also make use of older gradient information that has been integrated in steps be-
fore into ∆wkj . E(w) is the error function evaluated for the current weight vector w
which is a vector of all the weights in the neural network.
v ← βv + α∇E(w) (10.4)
w ← w−v (10.5)
where you can think of v as a velocity vector which is added to the current weight
vector w in each update step.
Resulting behavior. Now what will happen if we use Backpropagation with mo-
mentum term on a plateau, i.e., using the new weight change update formula 10.2 ?
On a plateau with a slight slope the local gradients will point into the same direction,
so we sum up gradients pointing into the same direction when applying the update
formula several times. This will make the weight change value ∆wkj larger and larger.
Or with other words: we will speed up our steps on the error surface at locations
where the gradients consistently point into the same direction.
CHAPTER 10. DEEP LEARNING TRICKS 198
Figure 10.1: Backpropagation with momentum term can overcome plateaus much
more faster than standard Backpropagation. Imagine also a perfect plateau: How
could Backpropagation with momentum term be superior to standard Backpropaga-
tion here? Having still some momentum from previous weight change updates it could
run over the plateau even if the plateau is perfect (i.e., without a slope), while stan-
dard Backpropagation would not be able to overcome it since the gradient is zero on
a perfect plateau!
CHAPTER 10. DEEP LEARNING TRICKS 199
10.4 AdaGrad
Idea 1: A learning rate per weight. All three optimizers we know so far (standard
gradient descent, momentum optimizer, Nesterov momentum optimizer) use the same
learning rate α for all weights in the weight vector w. However, probably not all the
weights in a network will have the same impact onto the output values. There will be
typically some weights in the network that will have a larger impact onto the output
values than others in the sense that if you turn them up or down the output values will
change much more faster compared to other weights. Perhaps it makes sense to use a
smaller learning rate for such weights that have a high impact onto the output values
(and in consequence also onto the errors the network makes). The current impact of
a weight onto the error, i.e., how fast the error changes if we turn the weight up or
down, is represented by the gradient of the error evaluated at the current location in
weight space.
Idea 2: Decrease learning rates during learning. And perhaps it makes sense
first to make large steps on the error surface and then as learning proceeds to make
smaller and smaller steps on the error surface. The idea behind this is to roughly
adjust the weights at the start and to do the fine-tuning at the end of the training.
The formulas. These two ideas are implemented by the AdaGrad algorithm. Instead
of using an explicit learning rate for each weight which is decreased, the algorithm
iteratively updates a sum vector s of the squared gradients ∇E(w) ⊗ ∇E(w) (where ⊗
means the element-wise multiplication of two√ vectors) and scales the current gradient
−8
vector by dividing by the root of this sum s + . A small value (e.g. 10 ) is used
to make sure the square root argument is different from zero, i.e., always valid:
Here means the element-wise division of two vectors. Since the gradients are scaled
the algorithm is called Adaptive Gradients or in short AdaGrad. The algorithm was
introduced in 2011 in a paper by John Duchi et al. [9]. However, it is not the easiest
paper to read due to the mathematical derivation of the algorithm presented in the
paper.
historical gradients. An advantage of this method is, that the algorithm is not very
sensitive to the overall learning rate α. Typically it set to a value of α = 0.01. Howver,
a disadvantage of the method is that the learning rates are often decreased too quickly
resulting in a stopping of the learning before the local minimum has been reached.
1 optimizer = tf.train.AdagradOptimizer(learning_rate=0.01)
10.5 RMSProp
Similar to AdaGrad, but avoiding a major problem of it. RMSProp or Root
Mean Square Propagation is an algorithm by Tieleman and Hinton from 2012 which
avoids the disadvantage of AdaGrad by accumulating only the gradients from the
last steps. This in turn avoids that the vector s becomes√ larger and larger and that
the gradient ∇E(w) is scaled by a large denominator s + and that the training
therefore stops since the weights are effectively not changed any more. RMSProp
achieves this by multiplying the vector s, i.e., the sum of gradients accumulated so
far, with a multiplication factor β (typically set to 0.9). Doing this iteratively will
result in an exponential decay of the sum of accumulated gradients. By contrast, the
current gradient ∇E(w) will be multiplied with 1 − β, resulting in slightly different
weight updates formulas compared to AdaGrad:
The pattern where we compute the new value of s by multiplying the old value with
β and adding a new value multiplied with 1 − β is a special case of a moving average
which is called an exponential moving average (EMA) and sometimes also called an
exponentially weighted moving average (EWMA).
In TensorFlow a RMSProp optimizer can be generated using the following code line:
10.6 Adam
Combining RMSProp and Momentum optimization. Adam or Adaptive Mo-
ment Estimation was published in 2015 by Kingma and Ba [21]. It can be seen as a
combination of both RMSProp and the idea of momentum optimization. Have a look
at the formulas:
Here t is the update step number. Equation 10.12 reminds us of momentum opti-
mization, while equation 10.13 is stolen from RMSProp. So Adam does both: it
accumulates the last gradients by computing an EWMA of the past gradients (similar
to momentum optimization) and it accumulates the squared gradients by computing
an EWMA of the squared past gradients as RMSProp.
Adam: A method that uses first and second moments of gradients. With
the help of the first equation we estimate the first moment (the mean) and with the
help of the second equation we estimate the second moment (the uncentered variance)
of the gradients respectively.
Bias correction. But what are equations 10.14 and 10.15 good for? Normally, v
and s are initialized with zero vectors 0. But thereby they will be biased each to
CHAPTER 10. DEEP LEARNING TRICKS 203
zero vectors, especially during the initial time steps. In order to tackle this problem,
these two equations correct for the biases. In their paper [21] (see page 2) the authors
propose to set β1 = 0.9, β2 = 0.999 and = 10−8 .
Final weight update rule for Adam. Equation 10.16 finally tells us how to update
the weights according to Adam: We do a (learning rate scaled) update step into the
direction of the (EWMA) mean of the past gradients, which is further scaled by the
EWMA of the squared gradients.
In TensorFlow an Adam optimizer can be generated using the following code line:
For a fair comparison, we need the MLP weights to be the same random start weights.
This can be achieved in TensorFlow using the following command:
1 tf.set_random_seed(12345)
Place this command, before you generate the variables for the MLP. It initializes the
pseudo random number generator with the same seed for each new start of the Python
script and will allow us to start each experiment with the same weights and bias values:
1 weights = {
2 ’h1’: tf.Variable(tf.random_normal(
3 [NR_NEURONS_INPUT, NR_NEURONS_HIDDEN1])),
4 ’h2’: tf.Variable(tf.random_normal(
5 [NR_NEURONS_HIDDEN1, NR_NEURONS_HIDDEN2])),
6 ’out’: tf.Variable(tf.random_normal(
7 [NR_NEURONS_HIDDEN2, NR_NEURONS_OUTPUT]))
8 }
9 biases = {
10 ’b1’: tf.Variable(tf.random_normal(
11 [NR_NEURONS_HIDDEN1])),
CHAPTER 10. DEEP LEARNING TRICKS 204
12 ’b2’: tf.Variable(tf.random_normal(
13 [NR_NEURONS_HIDDEN2])),
14 ’out’: tf.Variable(tf.random_normal(
15 [NR_NEURONS_OUTPUT]))
16 }
1 use_opt_nr = 5
2
3 if use_opt_nr==1:
4 optimizer =\
5 tf.train.GradientDescentOptimizer(LEARN_RATE)
6 elif use_opt_nr==2:
7 optimizer = \
8 tf.train.MomentumOptimizer(learning_rate=LEARN_RATE,
9 momentum=0.9)
10 elif use_opt_nr==3:
11 optimizer = \
12 tf.train.AdagradOptimizer(learning_rate=0.01)
13 elif use_opt_nr==4:
14 optimizer = \
15 tf.train.RMSPropOptimizer(learning_rate=0.01,
16 decay=0.9)
17 elif use_opt_nr==5:
18 optimizer = \
19 tf.train.AdamOptimizer(learning_rate=0.001,
20 beta1=0.9,
21 beta2=0.999,
22 epsilon=10e-8)
23
24 optimizer = optimizer.minimize(loss)
In the code you can set the variable use opt nr to 1-5 in order to compare the
different optimizers. The result of this comparison can be seen in Fig. 10.3.
After 200 epochs the momentum optimizer has reached a decision boundary that
better represents the underlying training data than the decision boundary learned by
the gradient descent optimizer.
CHAPTER 10. DEEP LEARNING TRICKS 205
Figure 10.3: Comparison of decision boundaries learned after epochs 0,50,100,200 for a
MLP with the same start weights using gradient descent (row 1), momentum optimizer
(row 2), AdaGrad (row 3), RMSProp (row 4), Adam (row 5). Last row: training data
used. You can see that especially RMSProp and Adam have found quite good decision
boundaries. At a first glance it is therefore not that strange that RMSProp and Adam
are often used to train neural networks.
CHAPTER 10. DEEP LEARNING TRICKS 206
The remedy. The proposed remedy for this problem is to reduce thus the amount of
change in the distribution of the activation values of neurons in a layer by introducing
a normalization step of these values per layer. In order to normalize the activation
values of all neurons in a layer some statistical information about it will be computed,
namely the mean and variance of the values. The mean is then used to shift the values
such that they always have a new mean of zero. The variance in turn is used to rescale
the values such that the new normalized values have a variance of 1. The shifting and
rescaling of the activation values will be controlled by two new hyperparameters per
layer (one for the shifting, one for the scaling). In short: Batch normalization lets the
model learn the best scale and mean of the activation values in each layer l, such that
the overall goal of reducing the error over the training dataset is reduced.
X −µ
X̂ = (10.17)
σ
if E(X) = µ and V ar(X) = σ 2 . The new standardized random variable X̂ has a mean
of E(X̂) = 0 and a variance V ar(X̂) = 1.
m
1 X
µB = act(i) (10.18)
m
i=1
m
2 1 X
σB = (act(i) − µB )2 (10.19)
m
i=1
act(i) −
µB
x̂(i) = q (10.20)
2 +
σB
z(i) = γx̂(i) + β (10.21)
2 , x̂(i) and z(i) are all vectors!
Note that the terms µB , σB
• µB ∈ RN is the vector that contains for each of the N neurons in the layer
considered its mean activation value, evaluated on the mini-batch
2 ∈ RN is also a vector and contains for each of the N neurons in the layer
• σB
considered the variance of the activation values observed while evaluating the
mini-batch
• x̂(i) ∈ RN is the vector of normalized activation values for the N neurons in the
layers for the i-th sample from the mini-batch. Instead of act(i) we use x̂(i) as an
intermediate-step since it has mean 0 and variance 1 as a standardized random
vector.
• z(i) ∈ RN is what we finally use. Instead of using the actual neuron activation
values in layer l which are stored in the vector act(i) , Batch normalization sug-
gests to use the vector z(i) , which stores a rescaled and shifted version of the
original activation values. These normalized activation values are then the input
for the corresponding transfer functions of the neurons.
• γ is the re-scaling parameter for the layer considered. The value of γ is learned
during training.
• β is the shifting parameter for the layer considered. The value of β is learned
during training.
• as usually denotes a very small number. Here it is used in order to make sure,
that the denominator is not zero and thus the fraction well-defined.
of the neurons in the layer evaluated on basis of the mini-batch. But during testing
(”inference”) there are no mini-batches. For this, we already compute during the
training a moving average of the mean and variance of the activation values of the
neurons per layer which can then later be used in the test phase. This means that
actually we do not only need to learn γ and β for each layer, but also to compute
(not learn!) two moving averages per layer: a moving average of the mean of neuron
activation values and a moving average of the variance of the activation values.
11
Beyond Deep Learning
At Quora - a question-and-answer site - people ask ”What’s next after deep learning?”
and ”What is the next step beyond deep learning in AI?”. At the beginning of this
book I wanted to underline my opinion that learning a hierarchy of local feature
detectors is a good idea that can be borrowed from nature, but that there are many
more helpful principles for information processing that could be exploited. In this
last chapter I will present some possible further approaches, which are very promising
from my point of view.
209
CHAPTER 11. BEYOND DEEP LEARNING 210
training patterns in each epoch based on the observed errors at the output neurons.
Training patterns that still produce large errors at the output side of the model should
be included in the next batch with a much higher probability than training patterns
that are mapped nearly flawlessly.
Of course, this approach bears a danger in itself: Overfitting. If we first train on all
patterns, but then neglect some training patterns, because they are mapped already
quite well, we reduce the training set towards the end of the training phase practically
to a real subset. Thereby, we increase the probability of overfitting. Further, we run
the risk that we even forget about things that we have learned before. However, if
these possible negative effects are regarded and a solution can be found, compiling the
training batches in a more clever way could give us the possibility to reduce even the
training time.
This rises a good question: How does nature solve the stability-plasticity problem?
The Complementary Learning System (CLS) theory seems to show a direction for
answering this question: The hippocampal system is able to to adapt on the short-
term time scale and allows to rapidly learn new information (and does not represent
old knowledge), while the neocortical system operates on a long-term time scale and
learns an overlapping representation of different things learned before. The interplay
between both systems is crucial: The hippocampus shows a high learning rate and
can quickly adapt to new input data, but for storing the new information into the
CHAPTER 11. BEYOND DEEP LEARNING 211
neocortical system it has to be played back into the neocortex (probably during sleep)
to achieve its long-term retention.
In general, we learn things in an incremental fashion and thereby make use of pre-
viously acquired motoric and non-motoric knowledge (facts). An important question
that arises here is: Which model or which architecture is needed for realizing incremen-
tal learning? Many research papers, e.g. in robotics, show that some certain ability X
can be learned with a combination of certain machine learning techniques. The papers
are written in a fashion similar to ”We recorded N training examples of tennis returns,
then used clustering technique C to identify prototypes of tennis returns, mapped each
prototype to a lower-dimensional space using PCA, then modeled each tennis return
using a Hidden Markov Model (HMM) and a Markov Random Field (MRF) etc.”.
But it is completely unclear, how such models can be reused - after being trained to
model a certain ability X - in order to model also capabilities Y and Z and - perhaps
even more importantly - how the learning of Y and Z can be accelerated based on
what we have learned for realizing ability X.
What do you think if suddenly a small ball rolls out from behind a car onto the street?
Probably, ”Attention! A child could run out from behind the car onto the street as
well without looking for cars”. Unfortunately your emergency breaking assistant was
not trained to detect balls, because balls are not considered as an important road user
class. For this, probably no emergency break will be activated in such a situation.
The message of this example is, that some general knowledge is probably necessary
for many machine learning applications. But then the question arises: How can we
encode and learn general knowledge? The approach of Symbolic Artificial Intelligence
(also called Good Old-Fashioned Artificial Intelligence (GOFAI)) was to try to manu-
ally encode general knowledge in symbolic forms, e.g., as predicates and rules. In my
opinion, history has shown, that practically this approach only works for a very limited
amount of general knowledge. Further, the symbol grounding problem arises which is
the question of how these abstract symbols as ”child”, ”ball”, ”to roll”, ”small”, etc.
get their meanings.
A solution to the symbol grounding problem is to associate the symbols with sensoric
and actoric signals. So the symbols are grounded in sensoric-actoric patterns. But
that in turn means that your system has to have sensors and actuators which allow
the system to learn these symbol ↔ sensoric-actoric associations. A system which has
sensors and actuators could be a robot and many robotic researchers think that for
realizing a general artificial intelligence (GAI) an AI needs a body. The belief that
a body is very helpful for an AI to learn general knowledge is called the embodiment
thesis. I think it is the only feasible way to give an AI general knowledge: Give it a
body and let it learn like a child which objects are out there and how they can be used
to manipulate the world. This learning process is called affordance learning, where
affordances is a term which robotic researchers borrowed from psychology to describe
the possibilities what a given body can do with an object.
A new perspective onto the brain is that the brain is more a prediction machine. Work
[5] conducted in the context of the Human Brain Project (HBB) seems to underline
this perspective. Based on what you have learned since your birth and what you
have sensed in the last moments your brain predicts what will happen next. And the
CHAPTER 11. BEYOND DEEP LEARNING 213
brain will already send signals to your muscles just based on theses predictions. These
predictions are then compared with what really happens and either will confirm the
prediction and thereby the ongoing movement or will update the prediction which in
turn will alter the ongoing movement.
The latter perspective seems to be more plausible, since acting quickly and pro-actively
in the world seems to give living beings an evolutionary advantage. And of course,
returning a tennis ball back to your opponent is much easier or even only possible if
you can predict where the tennis ball lands.
However, even for these models which store some information from previous time steps
it is unlikely that they can be used for realizing complex cognitive functions as rea-
soning about a scene in an image, or deciding which goal to achieve, developing an
idea (”plan”) how to achieve it and pursuing a selected goal. We need some larger,
probably more complex, model that is able to realize different tasks with the same
cognitive building blocks.
The idea of a cognitive architecture is not new. In a survey paper on cognitive archi-
tectures called ”A Review of 40 Years of Cognitive Architecture Research: Focus on
Perception, Attention, Learning and Applications” [23], Kotseruba et al. start their
paper with a list of 84 (!) different cognitive architectures and depict in a table which
cognitive architecture have been presented in which of the 16 previous surveys on
cognitive architectures.
Thereby, many cognitive architectures borrow ideas from cognitive science and name
cognitive modules in their architecture correspondingly. For example, the ”old” ACT-
R (by John R. Anderson, a professor for psychology at CMU) cognitive architecture
models the limited working memory with the help of ”buffers”. Elements of informa-
tion are stored as ”chunks” in these buffers, where chunks consist of a set of slots with
concrete values. A goal buffer e.g. encodes which goal is tried to achieve next and a
CHAPTER 11. BEYOND DEEP LEARNING 214
retrieval buffer can store one chunk that was retrieved from the declarative memory.
The procedural memory stores a set of production rules which encode cognitive pro-
cesses: A single production rule tests for some conditions / contents of the buffers and
if it can ”fire” (i.e., if all the buffer conditions are met), describes how to change the
buffer contents. By this, cognition is modeled as a sequence of firing of production
rules.
Sounds good. However, to the best of my knowledge, none of these cognitive architec-
tures has shown to produce a cognition that can solve different real world problems.
I started this chapter by telling that many people ask ”What’s next after deep learn-
ing?”. My best guess is: A cognitive architecture that is able to learn to solve different
real world problems. I think, cognitive architectures are a promising way.
And perhaps in the year 2024 there will be conversation between two students similar
to this one:
Christine: I updated my robot with the new CogFlow cognitive architecture that came
out last week! The robot can now do the dishes, wash and iron my clothes and play
tennis with much better due to better predictions!
Christine: Coglets! They are called coglets! Come on! You really should know what a
coglet is if this is the evening before your oral exam in computer science! Coglets are
the building blocks of all newer cognitive architectures. They model all mental pro-
cesses in natural brains. After the seminal paper of Alexandrowitsch Krimkowski in
2022 coglets lead to a renaissance in interest to the old field of Deep learning when he
showed a 11% performance gain on the famous MultiRealWorldTaskNet benchmark
using coglets.
CHAPTER 11. BEYOND DEEP LEARNING 215
Jürgen rapidly finishes his dinner, then goes to his room to read into cognitive archi-
tectures and coglets.
Solutions for all the exercises can be found in the corresponding GitHub repository
accompanying this book:
https://github.com/juebrauer/Book_Introduction_to_Deep_Learning
You will see that there are two different versions of Python that are available: Python
3.6.3 and Python 2.7.14. Which one should I download?
It is important to know that code written in Python 3.x is not compatible with code
written in Python 2.x. On the one side Python 2.7 still provides a larger set of pack-
ages which has the effect that some programmers choose to remain with Python 2.7.
216
CHAPTER 12. EXERCISES 217
Fortunately, we need not to decide and can use a tool called conda to remain flexible.
conda allows to create environments which can host different versions of Python in-
terpreters and different packages.
Step 1:
Go to https://www.anaconda.com/download/ and download (Windows) Ana-
conda (64 bit, Python 3.6 version). Anaconda contains the conda package and en-
vironment manager that will allow us to create environments with different Python
versions.
Step 2:
Start the Anaconda navigator. Then create two environments. An environment
env python27
where you choose Python 2.7 as interpreter and another environment
env python36
where you choose Python 3.6 as interpreter.
Step 3:
Install PyCharm, a nice Python IDE. Start PyCharm and create a project
a python2 project. Open the
File → Settings → Project Interpreter
CHAPTER 12. EXERCISES 218
dialog and add the directory where conda stores the env python27 environment to
the Project Interpreter selection box (”Add local”) by choosing the Python interpreter
in that environment directory. Then choose this newly added interpreter as the project
interpreter.
Step 4:
Now add a new file python2 code.py to the project, enter the following code and
run it:
1 import sys
2 print "Your Python version is: " + sys.version
Step 5:
Now prepare a a python3 project, choose the env python36 environment as
project interpreter and add a python3 code.py source code file to this project with
the following code:
1 import sys
CHAPTER 12. EXERCISES 219
Step 6:
Now search for articles in the internet describing some differences between Python2
and Python3 and augment the code in both project source files python2 code.py
and python3 code.py to show and try out these differences. Try to find as many
differences as possible, at least 5.
CHAPTER 12. EXERCISES 220
1 10 13 16 19 22 25 28
2 10 13 16 19 22 25 28
b) Format the following code correctly such that it runs and produces the output
specified:
1 counter=0
2 while(counter<10):
3 print(counter, end=’ ’)
4 counter +=1
5 else:
6 print("counter=" + str(counter))
7
8 for i in range(1, 10):
9 if(i%5==0):
10 break
11 print(i, end=’ ’)
12 else:
13 print("i=" + str(i))
Desired output:
1 0 1 2 3 4 5 6 7 8 9 counter=10
2 1 2 3 4
Dynamic typing
a) How can you get the information which data type Python uses internally for each
of the variables a-f? Print the data type for each of the variables.
1 a = 2
2 b = 3.1
3 c = ’d’
4 d = "a string"
CHAPTER 12. EXERCISES 221
5 e = [1,22,333]
6 f = (4,55,666)
1 b = 21
2 b = b+b
3 print(b)
4 b = "3"
5 b = b+b
6 print(b)
7 print(type(int(b)))
8 print(type(str(type(int(b)))))
9 print(str(type(str(type(int(b)))))[3])
Selections
a) Let the user enter a number. If the number is between 1-3 output ”A”, if it is
between 4-6 output ”B”, if it is between 7-9 output ”C”. If it is not between 1-9
output ”Invalid number!”. The user shall be able to enter numbers till he enters the
word ”exit”. If he enters a string that is not a number and different from the word
”exit”, output ”Invalid command!”
1 1
2 A
3 1.384
4 A
5 5.4
6 B
7 9
8 C
9 9.0
10 C
11 9.1
12 Invalid number!
13 10
14 Invalid number!
15 -1.45
16 Invalid number!
17 test
18 Invalid command!
19 exit
CHAPTER 12. EXERCISES 222
Functions
a) Define a function f1 that accepts two parameters value1 and value2, computes the
sum and the product and returns both. What is the type of the ”thing” that you
return?
b) Now define a function f2 that does the same but provides default values for both
arguments (e.g. default value1 = 2, default value2 = 3). So the behavior of your
function f2 should look like this without specifying any of the values:
1 print(f2())
2 (5,6)
Is it possible now to call f2 without specifying value1, but only value2? How?
Lists
a) Define a list which stores the strings ”cheese”, ”milk”, ”water”. Output the list.
Then append the string ”apples”. Output the list. Remove the string ”milk”. Output
the list. Iterate over the list and output each element.
Classes
a) Define a class ”car” that accepts the car’s name and its maximum speed as param-
eters in the class constructor. Three attributes shall be stored for a car: its name,
its maximum speed and its mileage. Define a method ”set speed” which allows to
set the current speed. Also define a method ”drive” which accepts as parameter the
number of hours to drive and increases the mileage accordingly. Finally, also provide
a method ”show status” which outputs the current speed and mileage. Test your class
by creating two object instances.
b) Derive a class ”convertible” that uses ”car” as base class. Pass the time for let-
ting the roof of the convertible down as an argument to the class. Overwrite the
”show status” method of the base class and output also the time to open the roof.
CHAPTER 12. EXERCISES 223
https://chrisconlan.com/installing-python-opencv-3-windows/
1 import sys
2 import cv2
3
4 print("Your Python version is: " + sys.version)
5 print("Your OpenCV version is: " + cv2.__version__)
6
7 # Load an color image as a grayscale image
8 img = cv2.imread(’coins.jpg’,0)
9
10 # Show the image
11 cv2.imshow(’image’,img)
12
13 # Wait for user to press a key
14 cv2.waitKey(0)
1 Your Python version is: 3.5.4 |Continuum Analytics, Inc.| (default, Aug 14
2017, 13:41:13) [MSC v.1900 64 bit (AMD64)]
2 Your OpenCV version is: 3.1.0
http://www.juergenbrauer.org/teaching/deep_learning/exercises_book/
test_data/video_testpattern.mp4
and use the following code to read in the video frame by frame:
1 import numpy as np
2 import cv2
CHAPTER 12. EXERCISES 225
3
4 cap = cv2.VideoCapture(’video_testpattern.mp4’)
5 #cap = cv2.VideoCapture(0)
6
7 while(cap.isOpened()):
8
9 ret, frame = cap.read()
10
11 if (ret == False):
12 break
13
14 gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
15
16 cv2.imshow(’frame’,gray)
17
18 c = cv2.waitKey(1)
19 # ’q’ pressed?
20 if (c==113):
21 break
22
23 cap.release()
Try out what happens on your computer if you comment (deactivate) line 4 and
comment out (activate) line 5.
If we consider a grayscale image with values from [0,255] and use the filter2D()
function with the above Kernel: Which is the largest and which is the smallest possible
resulting value that we could in principle observe in the filter result matrix?
Note: Make sure, that you normalize the resulting filter values such that imshow()
can be used to display all values of the filter result matrix!
CHAPTER 12. EXERCISES 226
You can install NumPy directly with the help of the Anaconda Navigator. However,
I recommend to install TensorFlow directly, which has dependencies to NumPy and
therefore NumPy will be installed automatically as well, since we will need TensorFlow
already in the next exercise.
Version
a) How can you output the version of your NumPy installation?
Generating arrays
a) Generate an 1D array a1 that consists of 5 integers. Then output the data type
used to store the entries of this array.
1 a1= [ 2 4 6 8 10]
2 Numbers in a1 are stored using data type int32
b) Change the data type that is used to store the numbers in a1 to float with 32
bit and make sure you did it right by outputting the data type used for storing the
numbers of a1 again:
c) Generate a 2D array a2 of size 3x2 (rows x columns) with the following numbers
and output it. Also output the number of dimensions of this array, the number of
rows and number of columns.
1 a2= [[1 2]
CHAPTER 12. EXERCISES 227
2 [3 4]
3 [5 6]]
4 number of dimensions of a2: 2
5 number of rows of a2: 3
6 number of columns of a2: 2
d) Output how much bytes are used to store a single array element of a2 and how
many bytes are used in total:
e) Generate the following 3D array a3. Also output the number of dimensions of a3
and the size of each dimension
1 a3= [[[ 1 2 3 4]
2 [ 5 6 7 8]
3 [ 9 10 11 12]]
4
5 [[13 14 15 16]
6 [17 18 19 20]
7 [21 22 23 24]]]
8 number of dimensions of a3: 3
9 number of slices of a3: 2
10 number of rows of a3: 3
11 number of columns of a3: 4
b) Change the value of that element to 42 and output the value again by retrieving
the value at that position from array a3:
c) Store the first slice from the 3D array a3 in a new 2D array a4 and output it:
1 a4= [[ 1 2 3 4]
CHAPTER 12. EXERCISES 228
2 [ 5 6 7 8]
3 [ 9 10 11 12]]
d) Now retrieve from a4 the third column as a 1D array a5 and output it:
1 a5= [ 3 7 11]
1 a6= [5 6 7 8]
1 a7= [[ 6 7]
2 [10 11]]
Reshaping arrays
a) Generate the following 1D array A. Then reshape it to a 2D array B with 2 rows
and 5 columns.
1 A= [ 1 2 3 4 5 6 7 8 9 10]
2 B= [[ 1 2 3 4 5]
3 [ 6 7 8 9 10]]
1 C= [[ 1 2]
2 [ 3 4]
3 [ 5 6]
4 [ 7 8]
5 [ 9 10]]
1 A=
CHAPTER 12. EXERCISES 229
2 [[1 1]
3 [0 1]]
4 B=
5 [[2 0]
6 [3 4]]
7 A+B=
8 [[3 1]
9 [3 5]]
10 element wise multiplication A*B=
11 [[2 0]
12 [0 4]]
13 matrix multiplication A*B=
14 [[5 4]
15 [3 4]]
b) Define the following matrix A. Then compute its inverse A inv. Check whether
the matrix product of both matrices A and A inv really gives you the 2x2 identity
matrix.
1 A=
2 [[ 1. 2.]
3 [ 3. 4.]]
4 A_inv=
5 [[-2. 1. ]
6 [ 1.5 -0.5]]
7 A * A_inv=
8 [[ 1.00000000e+00 1.11022302e-16]
9 [ 0.00000000e+00 1.00000000e+00]]
d) How can we automatically check whether A∗A inv gives us the 2x2 identity matrix?
Random arrays
a) How can you generate a random matrix of 5 rows and 3 columns with random
float values drawn from a uniform distribution with values in the range -1 and +1?
Generate such a matrix rndA, then output it:
1 rndA=
2 [[ 0.67630242 -0.49098576 -0.18706128]
3 [ 0.61222022 0.38307423 0.74869381]
4 [ 0.16949814 0.16301043 -0.77961425]
5 [-0.99861878 0.9521788 -0.55123554]
6 [ 0.92839569 0.45590548 0.09234368]]
CHAPTER 12. EXERCISES 230
1 rndB=
2 [[-1 0 -1]
3 [ 0 -1 0]
4 [-1 0 -1]
5 [ 0 -1 0]
6 [ 0 -1 0]]
CHAPTER 12. EXERCISES 231
Which classification rate do you get for the Perceptron after 25.000 training steps?
compare the classification rate of your simple classifier with the classification rate of
the Perceptron that has learned for 25.000 steps. Which one has a larger classification
rate?
CHAPTER 12. EXERCISES 233
Download URL:
http://www.juergenbrauer.org/teaching/deep_learning/exercises_book/test_data/
10x10_audio_dataset.zip
More precisely: Write a Python class MLP where new layers can be added flexibly.
For each new layer of neurons the add() method of this class shall accept the number
of neurons and the type of transfer function to be used for neurons in this layer as
parameters. Also write a method feedforward() which accepts an input vector
and then computes the output values of the neurons, starting in the first layer and
going to the last (output) layer step by step.
After implementing the MLP think about a test which allows to make it plausible that
your implementation works correctly.
Then generate a 100-500-5000-500-10 MLP (100 input neurons, 500 neurons in layer
1, etc.) and measure how long a single feedforward step takes.
CHAPTER 12. EXERCISES 235
Option 1
Take your own implementation of your MLP from Exercise 7 and augment it by a
method backprop() that accepts a teacher vector (vector of desired output values
of the output neurons) and uses the Backpropagation formulas presented in sections 7.4
- 7.6 for changing the weights to output neurons and hidden neurons. Then generate
some training data and test your implementation: can your MLP map the input
vectors to the desired output vectors after training?
Option 2
Take my implementation of a MLP with the Backpropagation formulas. You can find
it here:
https://github.com/juebrauer/Book_Introduction_to_Deep_Learning/tree/master/
code_examples_in_book/03_mlp
Read the code and try to understand how the Backpropagation formulas are imple-
mented in train(). Also try to understand what happens in test mlp.py. Then
prepare to explain the code in the form of a code walk-through to your fellow students
in the class.
Option 3
Search an easy to understand implementation of the Backpropagation algorithm in
Python. Understand the code at least at a level that allows to show your fellow
students where and how in the code the Backpropagation formulas are implemented.
Prepare to explain the code in the form of a code walk-through to your fellow students
in the class. Use this implementation to train a MLP using some training data: can
the MLP map the input vectors to the desired output vectors after training?
CHAPTER 12. EXERCISES 236
https://github.com/juebrauer/Book_Introduction_to_Deep_Learning/tree/master/
code_examples_in_book/03_mlp
with the time needed of your TensorFlow implementation to train one epoch as well.
Use the class data generator.py of my implementation to generate the 10.000 2D
training samples (= one batch) and visualize the decision boundaries learned by your
TensorFlow MLP after each epoch.
CHAPTER 12. EXERCISES 237
Experiment with i) different number of convolutional layers and ii) different number
of features per convolution layer. For each experiment compute the classification rate
of your model on the test dataset. Stop the training of your model a) if you have
presented a fixed number of training images or b) if the classification error on your
evaluation dataset seems not to shrink any longer. Choose a) or b) as a stop criterion.
In the exercise you should first present your implementation, then which experiments
you did and finally the results of your experiments.
CHAPTER 12. EXERCISES 238
Actually, the LSTM model and not the CNN model is the current dominating model
for speech processing. However, a CNN can be used for speech recognition as well and
it works quite good!
In this post the author shows how to convert an audio signal to a MFCC (Mel Fre-
quency Cepstral Coefficients) feature vectors computed using the help of the librosa
library. It further shows how to use Keras to build up and train a simple CNN using
just a few lines of code. With Keras using the model for prediction (”inference”)
similarly needs only some few lines of code.
https://github.com/manashmndl/DeadSimpleSpeechRecognizer
In the GitHub repository Manash already has uploaded 3 folders with ca. 1700 audio
files each for the words ’cat’, ’bed’, ’happy’ spoken by different speakers. These audio
files are part of a larger dataset available at Kaggle:
https://www.kaggle.com/c/tensorflow-speech-recognition-challenge
The dataset is part of the ”TensorFlow Speech Recognition Challenge” which takes
place in January 2018.
Write a Python program that uses Manash’s code snippets in order to classify the
three words ’cat’, ’bed’, ’happy’ with a CNN implemented in Keras. From the audio
folder take 100 audio streams each in order to test the accuracy of your learned model.
CHAPTER 12. EXERCISES 239
https://github.com/juebrauer/Book_Introduction_to_Deep_Learning/tree/master/
code_examples_in_book/03_mlp
Then generate a 2-5-5-5-5-5-2 MLP, i.e., a deep neural network with 2 input neurons,
2 output neurons and 5 hidden layers. Let the NN learn some 2D to 2D mapping
using examples of (2D input vector, 2D output vector). Then augment the MLP
implementation such that you also compute the average weight change per layer. Also
add a method for the MLP class such that you can output the average weight changes
per layer that are applied in the Backpropagation step. Conduct two experiments:
one with the logistic transfer function in the hidden neurons, one with the ReLU
transfer function in the hidden neurons. Can you see some systematics when viewing
the average weight change values per layer when using the ReLU transfer function?
How does the average weight change values of the first experiment compare to the
values in the second experiment?
CHAPTER 12. EXERCISES 240
http://ruishu.io/2016/12/27/batchnorm/
241
BIBLIOGRAPHY 242
[10] Luke Durant et al. Inside Volta: The World’s Most Advanced Data Center
GPU. 2017. url: https://devblogs.nvidia.com/parallelforall/
inside-volta/.
[11] Jon Fingas. Apple’s AI acquisition could help Siri make sense of your data. 2017.
url: https://www.engadget.com/2017/05/13/apple- acquires-
lattice-data/.
[12] Ina Fried. Intel is paying more than 400 million dollar to buy deep-learning
startup Nervana Systems. 2016. url: https://www.recode.net/2016/8/
9/12413600/intel-buys-nervana--350-million.
[13] Kunihiko Fukushima. “Cognitron: A self-organizing multilayered neural net-
work”. In: Biological Cybernetics 20.3 (1975), pp. 121–136. issn: 1432-0770.
doi: 10 . 1007 / BF00342633. url: http : / / dx . doi . org / 10 . 1007 /
BF00342633.
[14] Kunihiko Fukushima. “Neocognitron: A self-organizing neural network model
for a mechanism of pattern recognition unaffected by shift in position”. In: Bi-
ological Cybernetics 36.4 (1980), pp. 193–202. issn: 1432-0770. doi: 10.1007/
BF00344251. url: http://dx.doi.org/10.1007/BF00344251.
[15] Gordon D. Goldstein. “Perceptron Mark I”. In: Digital Computer Newsletter of
the Office of Naval Research (1960).
[16] Donald O. Hebb. The Organization of Behavior: A Neuropsychological Theory.
New Ed. New York: Wiley, June 1949. isbn: 0805843000. url: http://www.
amazon.com/exec/obidos/redirect?tag=citeulike07-20\&path=
ASIN/0805843000.
[17] J.C. Horton and DL Adams. “The cortical column: a structure without a func-
tion.” In: Philosophical Transactions of the Royal Society (2005), pp. 837–682.
[18] Norm Jouppi. Google supercharges machine learning tasks with TPU custom
chip. 2016. url: https://cloudplatform.googleblog.com/2016/05/
Google - supercharges - machine - learning - tasks - with - custom -
chip.html.
[19] Norman P. Jouppi et al. “In-Datacenter Performance Analysis of a Tensor Pro-
cessing Unit”. In: 2017. url: https://arxiv.org/pdf/1704.04760.pdf.
[20] KhanAcademy, ed. Neuron action potentials: The creation of a brain signal. url:
https://www.khanacademy.org/test-prep/mcat/organ-systems/
neuron - membrane - potentials / a / neuron - action - potentials -
the-creation-of-a-brain-signal.
[21] D.P. Kingma and J. Ba. “Adam: A Method for Stochastic Optimization”. In:
Proc. of International Conference on Learning Representations (ICLR) (2015).
BIBLIOGRAPHY 243
[34] German I. Parisi et al. “Continual Lifelong Learning with Neural Networks: A
Review”. In: arXiv:1802.07569 (2018). https://arxiv.org/abs/1802.07569. url:
https://www2.informatik.uni- hamburg.de/wtm/publications/
2018/PKPKW18/.
[35] James Randerson. How many neurons make a human brain? Billions fewer than
we thought. 2012. url: https://www.theguardian.com/science/blog/
2012/feb/28/how-many-neurons-human-brain.
[36] Joseph Redmon and Ali Farhadi. “YOLO9000: Better, Faster, Stronger”. In:
CoRR abs/1612.08242 (2016). arXiv: 1612 . 08242. url: http : / / arxiv .
org/abs/1612.08242.
[37] Joseph Redmon et al. “You Only Look Once: Unified, Real-Time Object De-
tection”. In: CoRR abs/1506.02640 (2015). arXiv: 1506.02640. url: http:
//arxiv.org/abs/1506.02640.
[38] Shaoqing Ren et al. “Faster R-CNN: Towards Real-Time Object Detection with
Region Proposal Networks”. In: CoRR abs/1506.01497 (2015). arXiv: 1506.
01497. url: http://arxiv.org/abs/1506.01497.
[39] Gerard Rinkus. “A cortical sparse distributed coding model linking mini- and
macrocolumn-scale functionality”. In: Frontiers in Neuroanatomy 4 (2010), p. 17.
issn: 1662-5129. doi: 10.3389/fnana.2010.00017. url: https://www.
frontiersin.org/article/10.3389/fnana.2010.00017.
[40] Frank Rösler. Psychophysiologie der Kognition - Eine Einführung in die Kogini-
tive Neurowissenschaft. Spektrum Akademischer Verlag, 2011.
[41] Yichuan Tang. “Deep Learning using Linear Support Vector Machines”. In:
ICML Challenges in Representation Learning Workshop (2013).
[42] The Economic Times, ed. The 11 most important Google acquisitions ever.
2014. url: http : / / economictimes . indiatimes . com / slideshows /
tech-life/the-11-most-important-google-acquisitions-ever/
dnnresearch-inc-neural-networks/slideshow/40253843.cms.
[43] Wikipedia, ed. List of mergers and acquisitions by Alphabet. url: https://
en.wikipedia.org/wiki/List_of_mergers_and_acquisitions_
by_Alphabet.