Introduction To Deep Learning
Introduction To Deep Learning
Science
Series Editor
Ian Mackie
Advisory Editors
Samson Abramsky
University of Oxford, Oxford, UK
Chris Hankin
Imperial College London, London, UK
Mike Hinchey
University of Limerick, Limerick, Ireland
Dexter C. Kozen
Cornell University, Ithaca, USA
Andrew Pitts
University of Cambridge, Cambridge, UK
Steven S. Skiena
Stony Brook University, Stony Brook, USA
Iain Stewart
University of Durham, Durham, UK
Undergraduate Topics in Computer Science (UTiCS) delivers high-
quality instructional content for undergraduates studying in all areas of
computing and information science. From core foundational and
theoretical material to final-year topics and applications, UTiCS books
take a fresh, concise, and modern approach and are ideal for self-study
or for a one- or two-semester course. The texts are all authored by
established experts in their fields, reviewed by an international
advisory board, and contain numerous examples and problems. Many
include fully worked solutions.
More information about this series at http://www.springer.com/
series/7592
Sandro Skansi
The publisher, the authors and the editors are safe to assume that the
advice and information in this book are believed to be true and accurate
at the date of publication. Neither the publisher nor the authors or the
editors give a warranty, express or implied, with respect to the material
contained herein or for any errors or omissions that may have been
made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.
This Springer imprint is published by the registered company Springer
International Publishing AG part of Springer Nature
The registered company address is: Gewerbestrasse 11, 6330 Cham,
Switzerland
Preface
This textbook contains no new scientific results, and my only
contribution was to compile existing knowledge and explain it with my
examples and intuition. I have made a great effort to cover everything
with citations while maintaining a fluent exposition, but in the modern
world of the ‘electron and the switch’ it is very hard to properly
attribute all ideas, since there is an abundance of quality material
online (and the online world became very dynamic thanks to the social
media). I will do my best to correct any mistakes and omissions for the
second edition, and all corrections and suggestions will be greatly
appreciated.
This book uses the feminine pronoun to refer to the reader
regardless of the actual gender identity. Today, we have a highly
imbalanced environment when it comes to artificial intelligence, and
the use of the feminine pronoun will hopefully serve to alleviate the
alienation and make the female reader feel more at home while reading
this book.
Throughout this book, I give historical notes on when a given idea
was first discovered. I do this to credit the idea, but also to give the
reader an intuitive timeline. Bear in mind that this timeline can be
deceiving, since the time an idea or technique was first invented is not
necessarily the time it was adopted as a technique for machine
learning. This is often the case, but not always.
This book is intended to be a first introduction to deep learning.
Deep learning is a special kind of learning with deep artificial neural
networks, although today deep learning and artificial neural networks
are considered to be the same field. Artificial neural networks are a
subfield of machine learning which is in turn a subfield of both
statistics and artificial intelligence (AI). Artificial neural networks are
vastly more popular in artificial intelligence than in statistics. Deep
learning today is not happy with just addressing a subfield of a subfield,
but tries to make a run for the whole AI. An increasing number of AI
fields like reasoning and planning, which were once the bastions of
logical AI (also called the Good Old-Fashioned AI , or GOFAI), are now
being tackled successfully by deep learning. In this sense, one might say
that deep learning is an approach in AI, and not just a subfield of a
subfield of AI.
There is an old idea from Kendo 1 which seems to find its way to the
new world of cutting-edge technology. The idea is that you learn a
martial art in four stages: big, strong, fast, light. ‘Big’ is the phase where
all movements have to be big and correct. One here focuses on correct
techniques, and one’s muscles adapt to the new movements. While
doing big movements, they unconsciously start becoming strong.
‘Strong’ is the next phase, when one focuses on strong movements. We
have learned how to do it correctly, and now we add strength, and
subconsciously they become faster and faster. While learning ‘Fast’, we
start ‘cutting corners’, and adopt a certain ‘parsimony’. This parsimony
builds ‘Light’, which means ‘just enough’. In this phase, the practitioner
is a master, who does everything correctly, and movements can shift
from strong to fast and back to strong, and yet they seem effortless and
light. This is the road to mastery of the given martial art, and to an art
in general. Deep learning can be thought of an art in this metaphorical
sense, since there is an element of continuous improvement. The
present volume is intended not to be an all-encompassing reference,
but it is intended to be the textbook for the “big” phase in deep
learning. For the strong phase, we recommend [1], for the fast we
recommend [2] and for the light phase, we recommend [3]. These are
important works in deep learning, and a well-rounded researcher
should read them all.
After this, the ‘fellow’ becomes a ‘master’ (and mastery is not the
end of the road, but the true beginning), and she should be ready to
tackle research papers, which are best found on arxiv.com under
‘Learning’. Most deep learning researchers are very active on
arxiv.com , and regularly publish their preprints. Be sure to check
out also ‘Computation and Language’, ‘Sound’ and ‘Computer Vision’
categories depending on your desired specialization direction. A good
practice is just to put the desired category on your web browser home
screen and check it daily. Surprisingly, the arxiv.com ‘Neural and
Evolutionary Computation’ is not the best place for finding deep
learning papers, since it is a rather young category, and some
researchers in deep learning do not tag their work with this category,
but it will probably become more important as it matures.
The code in this book is Python 3, and most of the code using the
library Keras is a modified version of the codes presented in [2]. Their
book 2 offers a lot of code and some explanations with it, whereas we
give a modest amount of code, rewritten to be intuitive and comment
on it abundantly. The codes we offer have all been extensively tested,
and we hope they are in working condition. But since this book is an
introduction and we cannot assume the reader is very familiar with
coding deep architectures, I will help the reader troubleshoot all the
codes from this book. A complete list of bug fixes and updated codes, as
well as contact details for submitting new bugs are available at the
book’s repository github.com/skansi/dl_book , so please check
the list and the updated version of the code before submitting a new
bug fix request.
Artificial intelligence as a discipline can be considered to be a sort of
‘philosophical engineering’. What I mean by this is that AI is a process of
taking philosophical ideas and making algorithms that implement
them. The term ‘philosophical’ is taken broadly as a term which also
encompasses the sciences which recently 3 became independent
sciences (psychology, cognitive science and structural linguistics), as
well as sciences that are hoping to become independent (logic and
ontology 4 ).
Why is philosophy in this broad sense so interesting to replicate? If
you consider what topics are interesting in AI, you will discover that AI,
at the most basic level, wishes to replicate philosophical concepts, e.g.
to build machines that can think, know stuff, understand meaning, act
rationally, cope with uncertainty, collaborate to achieve a goal, handle
and talk about objects. You will rarely see a definition of an AI agent
using non-philosophical terms such as ‘a machine that can route
internet traffic’, or ‘a program that will predict the optimal load for a
robotic arm’ or ‘a program that identifies computer malware’ or ‘an
application that generates a formal proof for a theorem’ or ‘a machine
that can win in chess’ or ‘a subroutine which can recognize letters from
a scanned page’. The weird thing is, all of these are actual historical AI
applications, and machines such as these always made the headlines.
But the problem is, once we got it to work, it was no longer
considered ‘intelligent’, but merely an elaborate computation. AI history
is full of such examples. 5 The systematic solution of a certain problem
requires a full formal specification of the given problem, and after a full
specification is made, and a known tool is applied to it, 6 it stops being
considered a mystical human-like machine and starts being considered
‘mere computation’. Philosophy deals with concepts that are inherently
tricky to define such as knowledge, meaning, reference, reasoning, and
all of them are considered to be essential for intelligent behaviour. This
is why, in a broad sense, AI is the engineering of philosophical concepts.
But do not underestimate the engineering part. While philosophy is
very prone to reexamining ideas, engineering is very progressive, and
once a problem is solved, it is considered done. AI has the tendency to
revisit old tasks and old problems (and this makes it very similar to
philosophy), but it does require measurable progress, in the sense that
new techniques have to bring something new (and this is its
engineering side). This novelty can be better results than the last result
on that problem, 7 the formulation of a new problem 8 or results below
the benchmark but which can be generalized to other problems as well.
Engineering is progressive, and once something is made, it is used
and built upon. This means that we do not have to re-implement
everything anew—there is no use in reinventing the wheel. But there is
value to be gained in understanding the idea behind the invention of
the wheel and in trying to make a wheel by yourself. In this sense, you
should try to recreate the codes we will be exploring, and see how they
work and even try to re-implement a completed Keras layer in plain
Python. It is quite probable that if you manage your solution will be
considerably slower, but you will have gained insight. When you feel
you understand it as much as you would like, you should just use Keras
or any other framework as building bricks to go on and build more
elaborate things.
In today’s world, everything worth doing is a team effort and every
job is then divided in parts. My part of the job is to get the reader
started in deep learning. I would be proud if a reader would digest this
volume, put it on a shelf, become and active deep learning researcher
and never consult this book again. To me, this would mean that she has
learned everything there was in this book and this would entail that my
part of the job of getting one started 9 in deep learning was done well.
In philosophy, this idea is known as Wittgenstein’s ladder, and it is an
important practical idea that will supposedly help you in your personal
exploration–exploitation balance.
I have also placed a few Easter eggs in this volume, mainly as
unusual names in examples. I hope that they will make reading more
lively and enjoyable. For all who wish to know, the name of the dog in
Chap. 3 is Gabi, and at the time of publishing, she will be 4 years old.
This book is written in plural, following the old academic custom of
using pluralis modestiae , and hence after this preface I will no longer
use the singular personal pronoun, until the very last section of the
book.
I would wish to thank everyone who has participated in any way
and made this book possible. In particular, I would like to thank Siniša
Urošev, who provided valuable comments and corrections of the
mathematical aspects of the book, and to Antonio Šajatović, who
provided valuable comments and suggestions regarding memory-based
models. Special thanks go to my wife Ivana for all the support she gave
me. I hold myself (and myself alone) responsible for any omissions or
mistakes, and I would greatly appreciate all feedback from readers.
References
1. I. Goodfellow, Y. Bengio, A. Courville, Deep Learning (MIT press,
Cambridge, 2016)
Sandro Skansi
Zagreb, Croatia
Contents
1 From Logic to Cognitive Science
1.1 The Beginnings of Artificial Neural Networks
1.2 The XOR Problem
1.3 From Cognitive Science to Deep Learning
1.4 Neural Networks in the General AI Landscape
1.5 Philosophical and Cognitive Aspects
References
2 Mathematical and Computational Prerequisites
2.1 Derivations and Function Minimization
2.2 Vectors, Matrices and Linear Programming
2.3 Probability Distributions
2.4 Logic and Turing Machines
2.5 Writing Python Code
2.6 A Brief Overview of Python Programming
References
3 Machine Learning Basics
3.1 Elementary Classification Problem
3.2 Evaluating Classification Results
3.3 A Simple Classifier:Naive Bayes
3.4 A Simple Neural Network:Logistic Regression
3.5 Introducing the MNIST Dataset
3.6 Learning Without Labels:K-Means
3.7 Learning Different Representations:PCA
3.8 Learning Language:The Bag of Words Representation
References
4 Feedforward Neural Networks
4.1 Basic Concepts and Terminology for Neural Networks
4.2 Representing Network Components with Vectors and
Matrices
4.3 The Perceptron Rule
4.4 The Delta Rule
4.5 From the Logistic Neuron to Backpropagation
4.6 Backpropagation
4.7 A Complete Feedforward Neural Network
References
5 Modifications and Extensions to a Feed-Forward Neural Network
5.1 The Idea of Regularization
5.2 and Regularization
2 This is the only book that I own two copies of, one eBook on my computer and one hard copy
—it is simply that good and useful.
3 Philosophy is an old discipline, dating back at least 2300 years, and ‘recently’ here means ‘in
the last 100 years’.
5 John McCarthy was amused by this phenomenon and called it the ‘look ma’, no hands’ period
of AI history, but the same theme keeps recurring.
6 Since new tools are presented as new tools for existing problems, it is not very common to
tackle a new problem with newly invented tools.
7 This is called the benchmark for a given problem, it is something you must surpass.
8 Usually in the form of a new dataset constructed from a controlled version of a philosophical
problem or set of problems. We will have an example of this in the later chapters when we will
address the bAbI dataset.
9 Or, perhaps, ‘getting initiated’ would be a better term—it depends on how fond will you
become of deep learning.
© Springer International Publishing AG 2018
Sandro Skansi, Introduction to Deep Learning, Undergraduate Topics in Computer Science
https://doi.org/10.1007/978-3-319-73004-2_1
Sandro Skansi
Email: [email protected]
References
1. A.M. Turing, On computable numbers, with an application to the entscheidungsproblem.
Proc. Lond. Math. Soc. 42(2), 230–265 (1936)
[MathSciNet][zbMATH]
2.
V. Peckhaus, Leibniz’s influence on 19th century logic. in The Stanford Encyclopedia of
Philosophy, ed. by E.N. Zalta (2014)
3.
J.S. Mill, A System of Logic Ratiocinative and Inductive: Being a connected view of the
Principles of Evidence and the Methods of Scientific Investigation (1843)
4.
G. Boole, An Investigation of the Laws of Thought (1854)
5.
A.M. Turing, Computing machinery and intelligence. Mind 59(236), 433–460 (1950)
[MathSciNet][Crossref]
6.
R. Carnap, Logical Syntax of Language (Open Court Publishing, 1937)
7.
A.N. Whitehead, B. Russell, Principia Mathematica (Cambridge University Press, Cambridge,
1913)
[zbMATH]
8.
J.Y. Lettvin, H.R. Maturana, W.S. McCulloch, W.H. Pitts, What the frog’s eye tells the frog’s
brain. Proc. IRE 47(11), 1940–1959 (1959)
[Crossref]
9.
N.R. Smalheiser, Walter pitts. Perspect. Biol. Med. 43(1), 217–226 (2000)
[Crossref]
10.
A. Gefter, The man who tried to redeem the world with logic. Nautilus 21 (2015)
11.
F. Rosenblatt, Principles of Neurodynamics: perceptrons and the theory of brain mechanisms
(Spartan Books, Washington, 1962)
[zbMATH]
12.
F. Rosenblatt, Recent work on theoretical models of biological memory, in Computer and
Information Sciences II, ed. by J.T. Tou (Academic Press, 1967)
13.
S. Russell, P. Norvig, Artificial Intelligence: A Modern Approach, 3rd edn. (Pearsons, London,
2010)
[zbMATH]
14.
H. Moravec, Mind Children: The Future of Robot and Human Intelligence (Harvard University
Press, Cambridge, 1988)
15.
M. Minsky, S. Papert, Perceptrons: An Introduction to Computational Geometry (MIT Press,
Cambridge, 1969)
[zbMATH]
16.
L.R. Graham, Science in Russia and the Soviet Union. A Short History (Cambridge University
Press, Cambridge, 2004)
17.
S. Pinker, The Blank Slate (Penguin, London, 2003)
18.
B.F. Skinner, The Possibility of a Science of Human Behavior (The Free House, New York,
1953)
19.
E.L. Gettier, Is justified true belief knowledge? Analysis 23, 121–123 (1963)
[Crossref]
20.
T.S. Kuhn, The Structure of Scientific Revolutions (University of Chicago Press, Chicago,
1962)
21.
N. Chomsky, Aspects of the Theory of Syntax (MIT Press, Cambridge, 1965)
22.
N. Chomsky, A review of B. F. Skinner’s verbal behavior. Language 35(1), 26–58 (1959)
[Crossref]
23.
A. Newell, J.C. Shaw, H.A. Simon, Elements of a theory of human problem solving. Psychol.
Rev. 65(3), 151–166 (1958)
[Crossref]
24.
J. Lighthill, Artificial intelligence: a general survey, in Artificial Intelligence: A Paper
Symposium, Science Research Council (1973)
25.
Paul J. Werbos, Beyond Regression: New Tools for Prediction and Analysis in the Behavioral
Sciences (Harvard University, Cambridge, 1975)
26.
D.B. Parker, Learning-logic. Technical Report No. 47 (MIT Center for Computational
Research in Economics and Management Science, Cambridge, 1985)
27.
Y. LeCun, Une procédure d’apprentissage pour réseau a seuil asymmetrique. Proc. Cogn. 85,
599–604 (1985)
28.
D.E. Rumelhart, G.E. Hinton, R.J. Williams, Learning internal representations by error
propagation. Parallel Distrib. Process. 1, 318–362 (1986)
29.
J.J. Hopfield, Neural networks and physical systems with emergent collective computational
abilities. Proc. Natl. Acad. Sci. USA 79(8), 2554–2558 (1982)
[MathSciNet][Crossref]
30.
N. Cristianini, J. Shawe-Taylor, An Introduction to Support Vector Machines and Other Kernel-
based Learning Methods (Cambridge University Press, Cambridge, 2000)
[Crossref]
31.
S. Hochreiter, J. Schmidhuber, Long short-term memory. Neural Comput. 9(8), 1735–1780
(1997)
[Crossref]
32.
Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based learning applied to document
recognition. Proc. IEEE 86(11), 2278–2324 (1998)
[Crossref]
33.
G.E. Hinton, S. Osindero, Y.-W. Teh, A fast learning algorithm for deep belief nets. Neural
Comput. 18(7), 1527–1554 (2006)
[MathSciNet][Crossref]
34.
J. Schmidhuber, Deep learning in neural networks: an overview. Neural Netw. 61, 85–117
(2015)
[Crossref]
35.
P. Domingos, The Master Algorithm: How the Quest for the Ultimate Learning Machine Will
Remake Our World (2015)
36.
M.S. Gazzanga, R.B. Ivry, G.R. Mangun, Cognitive Neuroscience: The Biology of Mind, 4th edn.
(W. W. Norton and Company, New York, 2013)
37.
A. Santos, Limitations of prompt-based training. J. Appl. Companion Anim. Behav. 3(1), 51–
55 (2009)
38.
J. Fodor, Z. Pylyshyn, Connectionism and cognitive architecture: a critical analysis. Cognition
28, 3–71 (1988)
[Crossref]
39.
T. Mikolov, T. Chen, G. Corrado, J, Dean, Efficient estimation of word representations in
vector space, in ICLR Workshop (2013), arXiv:1301.3781
Footnotes
1 Today, this field of research can be found under a refreshing but very unusual name: ‘logic in
the wild’.
3 This was 15 years before artificial intelligence was defined as a scientific field.
4 The author has a fond memory of this book, but beware: here be dragons. The book is highly
complex due to archaic notation and a system quite different from today’s logic, but it is a
worthwhile read if you manage to survive the first 20 pages.
6 An additional point here is the great influence of Russell and Carnap on Pitts. It is a great
shame that many logicians today do not know of Pitts, and we hope the present volume will
help bring the story about this amazing man back to the community from which he arose, and
that he will receive the place he deserves.
7 And any other scientific discipline which might be interested in studying or using deep
neural networks.
9 Even today people consider playing chess or proving theorems as a higher form of
intelligence than for example gossiping, since they point to the rarity of such forms of
intelligence. The rarity of an aspect of intelligence does not directly correlate with its
computational properties, since problems that are computationally easy to describe are easier
to solve regardless of the cognitive rarity in humans (or machines for that matter).
10 The view is further dimmed by the fact that the perceptron could process an image (at least
rudimentary), which intuitively seems to be quite harder than simple logical operations.
12 If you wish to try the equivalence instead of XOR, you should do the same but with
, , , , keeping the Os for
0 and Xs for 1. You will see it is literally the same thing as XOR in the context of our problem.
14 It must be acknowledged that Skinner, by insisting on focusing only on the objective and
measurable parts of the behaviour, brought scientific rigor into the study of behaviour, which
was previously mainly a speculative area of research.
16 The full story about Hinton and his struggles can be found at http://www.chronicle.com/
article/The-Believers/190147.
17 See http://www.ams.org/msc/.
18 See http://www.acm.org/about/class/class/2012.
19 Knowledge representation and reasoning for GOFAI, machine learning for deep learning.
20 Whether this is true or not, is irrelevant for our discussion. The literature on animal
cognitive abilities is notoriously hard to find as there are simply not enough academic studies
connecting animal cognition and ethology. We have isolated a single paper dealing with
limitations of dog learning [37], and therefore we would not dare to claim anything categorical
—just hypothetical.
21 Plato defined thinking (in his Sophist) as the soul’s conversation with itself, and this is what
we want to model, whereas the rule-based approach was championed by Aristotle in his
Organon. More succinctly, we are trying to reframe reasoning in platonic terms instead of using
the dominant Aristotelian paradigm.
22 At this point, we deliberately avoid talking of ‘valid inference’ and use the term ‘valid
thinking’.
23 Note that this interchangeability dependent on the big picture. If I need to move a piano, I
could not do it with a car, but if I need to fetch groceries, I can do it with either the car or the
van.
© Springer International Publishing AG 2018
Sandro Skansi, Introduction to Deep Learning, Undergraduate Topics in Computer Science
https://doi.org/10.1007/978-3-319-73004-2_2
Sandro Skansi
Email: [email protected]
individual elements ).
,
The logarithmic function has the properties: , ,
, ,
, , , ,
.
The last concept we will need before continuing to derivations is the
concept of a limit . An intuitive definition would be that the limit of a
function is a value which the outputs of the function approach but
never reach.6 The trick is that the limit of the function is considered in
relation to a change in inputs and it must be a concrete value, i.e. if the
limit is or , we do not call it a limit. Note that this means that for
the limit to exist it must be a finite value. For example, , if
2. exists
3. .
2.
3.
4.
5. .
slope is not the same in every point and by the above calculation we
will not be able to get much out of it, and we will have to use
differentiation. But differentiation is still just an elaboration of the
slope idea. Let us start with the slope formula and see where it takes us
when we try to formalize it a bit. So we start with . We can denote
with h the change in x with which we get from . This means that
the numerator can be written as , and the denominator
is just h by definition of h. The derivative is then defined as the limit of
that as h approaches 0, or
(2.1)
and using these rules we would quickly find that , but let us
see now how we can get this by using only the definition of the
derivative:
1. [initial function]
numerator]
placed the a that show how a possible factor behaves. The rules for
addition and subtraction are rather straightforward:
and . The rules for
, and if then .
The last rule we need is the so-called chain rule (not to be confused
with the chain rule for exponents). The chain rule says , for
can look at this function as if it were two functions: the first is g(u)
which gives some number (in our case this is ), and
two.13
To see the chain rule in action, take the function
The reader may notice that we have been talking about the standard
basis without defining what a basis is. Let V be a vector space and
. Then, B is called a basis if and only if all vectors in B are linearly
independent (i.e. are not linear combinations of each other) and B is a
minimally generating subset of V (i.e. it must be a minimal17 subset
which can produce with the help of Eq. 2.2) every vector in V.
We turn our attention to defining the single most important
operation with vectors we will need in this book, the dot product . The dot
product of two vectors (which must have the same dimensions) is a
scalar. It is defined as
(2.3)
(2.4)
Bear in mind not to confuse the notation for norms with the notation
for the absolute value. We will see more about the norm in the later
chapters. We can convert any vector to a so-called normalized vector
by dividing it with its norm:
(2.5)
Right away we see a couple of things. First, the entries in the matrix are
denoted by and j denotes the row, and k denotes the column of the
given entry. A matrix has dimensions similar to a vector, but it has to
have two of them. The matrix A is a dimensional matrix. Note that
this is not the same as a dimensional matrix. We can look at a
matrix as a vector of vectors (this idea has a couple of formal problems
that need to be ironed out, but it is a good intuition). Here, we have two
options: It could be viewed as vectors ,
, and
stacked in a new vector or it could be seen as
vectors , and
which are then bundled together as
.
Either way we look at it something is off since we have to keep track
of what is vertical and what is horizontal. It is clear that now need to
distinguish a standard, horizontal vector, called a row vector (a row of
the matrix taken out which is now just a vector), which is a
dimensional matrix
rule Exp on the first term, the differentiation variable rule DerDifVar
on the second term and the constant rule Const on the third term, we
get , which simplifies to . Let us see what we did:
we took the (full) derivative of (with a constant a in place of y),
which is the same as taking the partial derivative of f(x, y). In symbols,
we calculated , and the corresponding partial derivative is
and get
We need to find the value of x which shall result with the minimal f(x).19
From basic calculus, we know that this point will be (0, 1). The gradient
of f will have a single component , corresponding with
additional scaling factor of 0.3. This will make us take only 30% of the
step along the gradient we would normally take, and it will in turn
enable us to be more precise in our quest for minimization. Later, we
will call this factor the learning rate , and it will be an important part of
our models.
We will be making a series of steps towards the x which will
produce a minimal f(x) (or more precisely, a good approximation of the
actual minimal point21), we will denote the initial x by , and we will
denote all the other xs on the road towards the minimum in a similar
fashion. So to get we calculate , or, in numbers,
. By the same
we stop and call it a day.22 We could continue to get better and better
approximations, but we would have to stop eventually. Gradient descent
will take as closer and closer to the value of x for which the function f
has the minimal value, which is in our case .
(2.6)
The average height is also called mean of the height, and we can get a
mean for any feature which has numerical values such as weight, body
mass index, etc. Features that take numerical values are called
numerical features . So the mean is a ‘numerical middle value’, but what
can we do when we need a‘middle value’, for example, the population’s
occupation? Then, we can use the mode , which is a function which
returns simply the value which occurs most often, e.g. ‘analyst’ or
‘baker’. Note that the mod can be used for numerical features, but the
mode will treat the values 19.01, 19.02 and 19000034 as ‘equally
different’. This means that if we want to take a meaningful mod, e.g.
‘monthly salary’, we should round the salary to the nearest thousand, so
that 2345 becomes 2000 and 3987 becomes 4000. This process creates
the so-called bins of data (it aggregates the data), and this kind of data
preprocessing is called binning . This is a very useful technique since it
drastically reduces the complexity of non-numerical problems and
often gives a much clearer view of what is happening in the data.
Asides from the mean and the mode, there is a third way to look at
centrality. Imagine we have a sequence 1, 2, 5, 6, 10000. With this
sequence, the mod is quite useless, since no two values repeat and
there is no obvious way to do binning. It is possible to take the mean
but the mean is 2002.8, which is a lousy information, since it tells us
nothing about any part of the sequence.25 But the reason the mean
failed is due to the atypical value of 10000 in the sequence. Such
atypical values are called outliers. We will be in position to define
outliers more rigorously later, but this simple intuition on outliers we
have built here will be very useful for all machine learning endeavors.
Remember just that the outlier is an atypical value, not necessarily a
large value: instead of 10000, we could have had 0.0001, and this would
equally be an outlier.
When given the sequence 1, 2, 5, 6, 10000, we would like a good
measure of centrality which is not sensitive to outliers. The best-known
method is called the median . Provided that the sequence we analyse
has an odd number of elements, the median of the sequence is the value
of the middle element of the sorted sequence.26 In our case, the median
is 5. If we have the sequence 2, 1, 6, 3, 7, the median would be the
middle element of the sorted sequence 1, 2, 3, 6, 7 which is 3. We have
noted that we need an odd number of elements in the sequence, but we
can easily modify the median a bit to take care of the case when we
have an even number of elements: then sort the sequence, the two
‘middlemost’ elements, and define the median to be the mean of those
two elements. Suppose we have 4, 5, 6, 2, 1, 3, then the two elements we
need are 3 and 4, and their mean (and the median of the whole
sequence) is 3.5. Note that in this case, unlike the case with an odd
number of elements, the median is not also a member of the sequence,
but this is inconsequential for most machine learning applications.
Now that we have covered the measures of central tendency,27 we
turn our attention to the concepts of expected value, bias, variance and
standard deviation. But before that, we will need to address basic
probability calculations and probability distributions. Let us take a step
back and consider what probability is. Imagine we have the simplest
case, a coin toss. This process is actually a simple experiment: we have a
well-defined idea, we know all possible outcomes, but we are waiting to
see the outcome of the current coin toss. We have two possible
outcomes, heads and tails. The number of all possible outcomes will be
important for calculating basic probabilities. The second component we
need is how many times the desired outcome happens (out of all
times). In a simple coin toss, there are two possibilities, and only one of
them is heads, so , which means that the
probability of heads is 0.5. This may seem peculiar, but let us take a
more elaborate example to make it clear. Usually, probability of x is
denoted as P(x) or p(x), but we prefer the notation in this book,
since probability is quite a special property and should not be easily
confused with other predicates, and this notation avoids confusion.
Suppose we have a pair of D6 dice, and we want to know what is the
probability of getting a five28 on them. As before, we will need to
calculate where B is the total number of outcomes and A is the time
the desired outcome happens. Let us calculate A. We can get five on two
D6 dice in the following cases:
1. First die 4, second die 1
2. First die 3, second die 2
(2.8)
(2.9)
But let us see what is happening in the background when we talk about
the expected value. We are actually producing an estimator ,31 which is
a function which tells us what to expect in the future. What the future
will actually bring is another matter. The ‘reality’ (also known as
probability distribution) is usually denoted often by an uppercase letter
from the back of the alphabet such as X, while an estimator for that
probability distribution is usually denoted with a little hat over the
letter, e.g. . The relationship between an estimator and the actual
values we will be getting in the future32 is characterized by two main
concepts, the bias and the variance . The bias of relative to X is
defined as
(2.10)
Intuitively, the bias shows by how much the estimator misses the target
(on average). A related idea is the variance , which tells how wider or
narrower are the estimates compared to the actual future values:
(2.11)
(2.12)
If the events are not necessarily disjoint,33 we can use the following
equation:
(2.15)
Finally, we can define the conditional probability of two events. The
conditional probability of A given B (or in logical notation, the
probability of ) is defined as
(2.16)
Theorem 2.1
This is the first and only proof in this book,34 but we have included it
since it is a very important piece of machine learning culture, and we
believe that every reader should know how to produce it on a blank
piece of paper. If we assume conditional independence of ,
then there is also a generalized form of the Bayes’ theorem to account
for multiple conditions ( consists of ):
(2.17)
We see in the next chapter how this is useful for machine learning.
Bayes’ theorem is named after Thomas Bayes, who first proved it, but
the result was only published posthumously in 1763.35 The theorem
underwent formalization and the first rigorous formalization was given
by Pierre-Simon Laplace in his 1774 Memoir on Inverse probability and
later in his Théorie analytique des probabilités form 1812. A complete
treatment of Laplace’s contributions we have mentioned is available in
[9, 10].
Before leaving the green plains of probability for the desolate
mountains of logic and computability, we must address briefly another
probability distribution, the normal or Gaussian distribution. The
Gaussian distribution is characterized by the following formula:
(2.18)
It is quite a weird equation, but the main thing about the Gaussian
distribution is not the elegance of calculation, but rather the natural
and nice shape of the graph, which can be used in a number of ways.
You can see an illustration of how the Gaussian distribution with mean
0 and standard deviation 1 looks like (see Fig. 2.2a).
Fig. 2.2 Gaussian distribution and Gaussian cloud
The idea behind the Gaussian distribution is that many natural
phenomena seem to follow it, and in machine learning it is extremely
useful for initializing values that are random but at the same time are
centred around a value. This value is the mean, and it is usually set to 0,
but it can be anything. There is a related concept of a Gaussian cloud ,
which is made by sampling a Gaussian distribution with mean 0 for two
values at a time, adding the values to a point with coordinates (x, y)
(and drawing the results if one wishes to see it). Visually, it looks like a
‘dot’ made with the spray paint tool from an old graphical editing
program (see Fig. 2.2b).
3. Read the next symbol and if it is a dot, remember it, go right until
you find a blank, write the dot there. Else, if the next symbol is a
separator return to the beginning and stop.
References
1. J.R. Hindley, J.P. Seldin, Lambda-Calculus and Combinators: An Introduction (Cambridge
University Press, Cambridge, 2008)
[Crossref]
2.
G.S. Boolos, J.P. Burges, R.C. Jeffrey, Computability and Logic (Cambridge University Press,
Cambridge, 2007)
[Crossref]
3.
P. Renteln, Manifolds, Tensors, and Forms: An Introduction for Mathematicians and Physicists
(Cambridge University Press, Cambridge, 2013)
[Crossref]
4.
R. Courant, J. Fritz, Introduction to Calculus and Analysis, vol. 1 (Springer, New York, 1999)
[Crossref]
5.
S. Axler, Linear Algebra Done Right (Springer, New York, 2015)
[zbMATH]
6.
P.N. Klein, Coding the Matrix (Newtonian Press, London, 2013)
7.
H. Pishro-Nik, Introduction to Probability, Statistics, and Random Processes (Kappa Books
Publishers, Blue Bell, 2014)
8.
D.P. Bertsekas, J.N. Tsitsiklis, Introduction to Probability (Athena Scientific, Nashua, 2008)
9.
S.M. Stigler, Laplace’s 1774 memoir on inverse probability. Stat. Sci. 1, 359–363 (1986)
[MathSciNet][Crossref]
10.
A. Hald, Laplace’s Theory of Inverse Probability, 1774–1786 (Springer, New York, 2007), pp.
33–46
11.
W. Rautenberg, A Concise Introduction to Mathematical Logic (Springer, New York, 2006)
[zbMATH]
12.
D. van Dalen, Logic and Structure (Springer, New York, 2004)
[Crossref]
13.
A.M. Turing, On computable numbers, with an application to the Entscheidungsproblem.
Proc. Lond. Math. Soc. 42(2), 230–265 (1936)
[MathSciNet][zbMATH]
Footnotes
1 Notice that they also have the same number of members or cardinality, namely 2.
2 The counting starts with 0, and we will use this convention in the whole book.
3 The traditional definition uses sets to define tuples, tuples to define relations and relations to
define functions, but that is an overly logical approach for our needs in the present volume. This
definition provides a much wider class of entities to be considered functions.
6 This is why .
8 With the exception of division where the divisor is 0. In this case, the division function is
undefined, and therefore the notion of continuity does not have any meaning in this point.
9 Rational functions are of the form where f and g are polynomial functions.
12 The chain rule in Lagrange notation is more clumsy and void of the intuitive similarity with
fractions: .
product like .
14 These rules are not independent, since both ChainExp and Exp are a consequence of
CHAINRULE.
15 We deliberately avoid talking about fields here since we only use , and there is no reason
to complicate the exposition.
17 A minimal subset such that a property P holds is a subset (of some larger set) of which we
can take no proper subset such that P would still hold.
18 Matrix subtraction works in exactly the same way, only with subtraction instead of addition.
19 To get the actual f(x) we just need to plug in the minimal x and calculate f(x).
20 In the case of multiple dimensions, we shall do the same calculation for every pair of and
.
21 Note that a function can have many local minima or minimal points, but only one global
minimum. Gradient descent can get ‘stuck’ in a local minimum, but our example has only one
local minimum which is the actual global minimum.
24 Properties are called features in machine learning, while in statistics they are called
variables, which can be quite confusing, but it is standard terminology.
25 Note that the mean is equally useless for describing the first four and the last member taken
in isolation.
26 The sequence can be sorted in ascending or descending order, it does not matter.
27 This is the ‘official’ name for the mean, median and mode.
28 Not 5 on one die or the other, but 5 as in when you need to roll a 5 in to buy
29 In , the 6 denotes the number of values on each die, and the 2 denotes the number of dice
used.
30 What we called here ‘basic probabilities’ are actually called priors in the literature, and we
will be referring to them as such in the later chapters.
32 Note that ideally we would like an estimator to be a perfect predictor of the future in all
cases, but this would be equal to having foresight. Scientifically speaking, we have models and
we try to make them as accurate as possible, but perfect prediction is simply not on the table.
33 ‘Disjoint’ means .
36 This is not exactly how it behaves, but it is a simplification which is more than enough for
our needs.
37 Text editors are Notepad, Vim, Emacs, Sublime, Notepad++, Atom, Nano, cat and many
others. Feel free to experiment and find the one you like most (most are free). You might have
heard of the so-called IDEs or Integrated Development Environments. They are basically text
editors with additional functions. Some IDEs you might know of are Visual Studio, Eclipse and
PyCharm. Unlike text editors, most IDEs are not freely available, but there are free versions and
trial versions, so you may experiment with them before buying. Remember, there is nothing
essential an IDE can do but a text editor cannot, but they do offer additional conveniences in
IDEs. My personal preference is to use Vim.
39 In a programming jargon, when we say ‘the syntax is the same’ or ‘you can use a similar
syntax’ means that you should try to reproduce the same style but with the new values or
objects.
40 Note that even though the name we assign to a library is arbitrary, there are standard
abbreviations used in the Python community. Examples are np for Numpy, tf for TensorFlow,
pd for Pandas and so on. This is important to know since on StackOverflow you might find a
solution but without the import statements. So if the solution has np somewhere in it, it means
that you should have a line which imports Numpy with the name np.
42 In Python 3, this is no longer exactlythat list, but this is a minor issue at this stage of
learning Python. What you need to know is that you can count on it to behave exactly like that
list.
43 Notice that the code, as it stands now, does not have this problem, but this is a bug since a
problem would arise if the room temperature turns out to be an odd number, and not an even
number as we have now.
44 JSON stands for JavaScript Object Notation, and JSONs (i.e. Python dictionaries) are referred
to as objects in JavaScript.
© Springer International Publishing AG 2018
Sandro Skansi, Introduction to Deep Learning, Undergraduate Topics in Computer Science
https://doi.org/10.1007/978-3-319-73004-2_3
Sandro Skansi
Email: [email protected]
Let us see a very simple example to demonstrate how the naive Bayes
classifier works and how it draws its hyperplane. Imagine that we have
the following table detailing visits to a webpage:
Time Buy
morning no
afternoon yes
evening yes
morning yes
morning yes
afternoon yes
evening no
evening yes
morning no
afternoon no
afternoon yes
afternoon yes
morning yes
which calculates the logit (also known as the weighted sum) and the
logistic or sigmoid function:
We note the result 0.8163 and the actual label 1. Now we do the same
for the second input:
Noting again the result 0.7414 and label 0. And now we do it for the last
input row vector:
Noting again the result 0.8368 and the label 0. It seems quite clear that
we did good on the first, but failed to classify the second and third input
correctly. Now, we should update the weights somehow, but to do that
we need to calculate how lousy we were at classifying. For measuring
this, we will be needing an error function and we will be using the sum
of squared error or SSE19:
The ts are targets or labels, and the ys are the actual outputs of the
model. The weird exponents ( ) are just indices which range across
(3.1)
(3.2)
(3.3)
(3.4)
We now update the and b by using magic, and get
and . Later (in Chap. 4), we will see it is
actually done by something called the general weight update rule. This
completes one cycle of weight adjustment. This is colloquially called an
epoch, but we will redefine this term later in Chap. 4 to make it more
precise. Let us recalculate the outputs and the new SSE to see whether
the new set of weights is better:
(3.5)
(3.6)
(3.7)
(3.8)
(3.9)
(3.10)
We can see clearly that the overall error has decreased. We can
continue this procedure a number of times, and the error will decrease,
until at one point it will stop decreasing and stabilize. On rare
occasions, it might even exhibit chaotic behaviour. This is the essence of
logistic regression, and the very core of deep learning—everything we
do will be an upgrade or modification of this.
Let us turn our attention to data representation. So far we have used
an expanded view of the process so that we may see clearly everything,
but let us see how we can make the procedure more compact and
computationally faster. Notice that even though a dataset is a set (and
the order does not matter), it might make a bit of sense to put ,
and in a vector, since we will be using them one by one (the vector
would then simulate a queue or stack). But since they also share the
same structures (same features in the same place in each row vector),
we might opt for a matrix to represent the whole training set. This is
important in the computational sense as well since most deep learning
libraries have somewhere in the background C, and arrays (the
programming equivalent of matrices) are a native data structure in C,
and computation on them is incredibly fast.
So what we want to do is first turn the n d-dimensional input
vectors into and input matrix of the size . In our case, this is a
matrix:
(3.11)
(3.12)
(3.13)
(3.14)
(3.16)
where C is the cluster for which we calculate the Dunn coefficient. The
Dunn coefficient is calculated for each cluster and the quality of each
cluster can be assessed by it. The Dunn coefficient can be used to
evaluate different clusterings by taking the average of the Dunn
coefficients for each cluster in both clusterings30 and then comparing
them.
References
1. R. Tibshirani, T. Hastie, The Elements of Statistical Learning: Data Mining, Inference, and
Prediction, 2nd edn. (Springer, New York, 2016)
[zbMATH]
2.
F. van Harmelen, V. Lifschitz, B. Porter, Handbook of Knowledge Representation (Elsevier
Science, New York, 2008)
[zbMATH]
3.
R.S. Sutton, A.G. Barto, Reinforcement Learning: An Introduction (MIT Press, Cambridge,
1998)
[zbMATH]
4.
J.R. Quinlan, Induction of decision trees. Mach. Learn. 1, 81–106 (1986)
5.
M.E. Maron, Automatic indexing: an experimental inquiry. J. ACM 8(3), 404–417 (1961)
[Crossref]
6.
D.R. Cox, The regression analysis of binary sequences (with discussion). J. Roy. Stat. Soc. B
(Methodol.) 20(2), 215–242 (1958)
[zbMATH]
7.
P.J. Grother, NIST special database 19: handprinted forms and characters database (1995)
8.
Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based learning applied to document
recognition. Proc. IEEE 86(11), 2278–2324 (1998)
[Crossref]
9.
M.A. Nielsen, Neural Networks and Deep Learning (Determination Press, 2015)
10.
P.N. Klein, Coding the Matrix (Newtonian Press, London, 2013)
11.
I. Färber, S. Günnemann, H.P. Kriegel, P. Kroöger, E. Müller, E. Schubert, T. Seidl, A. Zimek. On
using class-labels in evaluation of clusterings, in MultiClust: Discovering, Summarizing, and
Using Multiple Clusterings, ed. by X.Z. Fern, I. Davidson, J. Dy (ACM SIGKDD, 2010)
12.
J. Dunn, Well separated clusters and optimal fuzzy partitions. J. Cybern. 4(1), 95–104
(1974)
[MathSciNet][Crossref]
13.
K. Pearson, On lines and planes of closest fit to systems of points in space. Phil. Mag. 2(11),
559–572 (1901)
[Crossref]
14.
C. Manning, H. Schütze, Foundations of Statistical Natural Language Processing (MIT Press,
Cambridge, 1999)
[zbMATH]
15.
D. Jurafsky, J. Martin, Speech and Language Processing (Prentice Hall, New Jersey, 2008)
16.
S. P. Lloyd, Least squares quantization in PCM. IEEE Transactions on Information Theory,
28(2), 129–137 (1982)
17.
E. W. Forgy, Cluster analysis of multivariate data: efficiency versus interpretability of
classifications. Biometrics, 21(3), 768–769 (1965)
Footnotes
1 You may wonder how a side gets a label, and this procedure is different for the various
machine learning algorithms and has a number o peculiarities, but for now you may just think
that the side will get the label which the majority of datapoints on that side have. This will
usually be true, but is not an elegant definition. One case where this is not true is the case
where you have only one dog and two cats overlapping (in 2D space) it and four other cats.
Most classifiers will place the dog and the two cats in the category ‘dog’. Cases like this are rare,
but they may be quite meaningful.
5 Think about how one-hot encoding can boost the understanding of n-dimensional space.
7 Notice that to do one-hot encoding, it needs to make two passes over the data: the first
collects the names of the new columns, then we create the columns, and then we make another
pass over the data to fill them.
8 Strictly speaking, these vectors would not look exactly the same: the training sample would
be (54,17,1,0,0, Dog), which is a row vector of length 6, and the row vector for which we want to
predict the label would have to be of length 5 (without the last component which is the label),
e.g. (47,15,0,0,1).
9 If we will be needing more we will keep more decimals, but in this book we will usually
round off to four.
10 It is mostly a matter of choice, there is no objective way of determining how much to split.
11 The prior probability is just a matter of counting. If you have a dataset with 20 datapoints
and in some feature there are five values of ‘New Vegas’ while the others (15 of them) are ‘Core
region’, the prior probability .
15 Afterwards, we may do a bit of feature engineering and use an all-together different model.
This is important when we do not have an understanding of the data we use which is often the
case in industry.
16 We will see later that logistic regression has more than one neuron, since each component
of the input vector will have to have an input neuron, but it has ‘one’ neuron in the sense of
having a single ‘workhorse’ neuron.
17 If the training set consists of n-dimensional row vectors, then there are exactly
features—the last one is the target or label.
19 There are other error functions that can be used, but the SSE is one of the simplest.
22 See http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec1.pdf.
23 Available at https://www.kaggle.com/c/digit-recognizer/data.
26 K-means (also called the Lloyd-Forgy algorithm) was first proposed by independently by S.
P. Lloyd in [16] and E. W. Forgy in [17].
28 Imagine that a centroid is pinned down and connected to all its datapoints with rubber
bands, and then you unpin it from the surface. It will move so that the rubber bands are less
tense in total (even though individual rubber bands may become more tense).
29 Recall that a cluster in K-means is a region around a centroid separated by the hyperplane.
30 We have to use the same number of centroids in both clusterings for this to work.
32 One of the reasons for this is that we have not yet developed all the tools we need to write
out the details now.
33 See Chap. 2.
34 And if a feature is always the same, it has a variance of 0 and it carries no information useful
for drawing the hyperplane.
35 An example of an expansion of the basic bag of words model is a bag of n-grams. An n-gram
is a n-tuple consisting of n words that occur next to each other. If we have a sentence ‘I will go
now’, the set of its 2-grams will be .
36 For most language processing tasks, especially tasks requiring the use of data collected
from social media, it makes sense to convert all text to lowercase first and get rid of all commas
apostrophes and non-alphanumerics, which we have already done here.
© Springer International Publishing AG 2018
Sandro Skansi, Introduction to Deep Learning, Undergraduate Topics in Computer Science
https://doi.org/10.1007/978-3-319-73004-2_4
Sandro Skansi
Email: [email protected]
data, and we will always choose the representation that makes it easier
and faster to compute the operations we might need. In our choice of
data representation, we are not constrained by anything else but
computational efficiency.
As we already noted, every neuron from the input layer is connected
to every neuron from the hidden layer, but neurons of the same layer
are not interconnected. Every connection between neuron j in layer k
and neuron m in layer n has a weight denoted by , and, since it is
usually clear from the context which layers are concerned, we may omit
the superscript and write simply . The weight regulates how much
of the initial value will be forwarded to a given neuron, so if the input is
12 and the weight to the destination neuron is 0.25, the destination will
receive the value 3. The weights can decrease the value, but they can
also increase it since they are not bound between 0 and 1.
Once again we return to Fig. 4.1 to explain the zoomed neuron on
the right-hand side. The zoomed neuron (neuron 3 from layer 2) gets
the input which is the sum of the products of the inputs from the
previous layer and respective weights. In this case, the inputs are ,
and , and the weights are , and . Each neuron has a
modifiable value in it, called the bias, which is represented here by ,
and this bias is added to the previous sum. The result of this is called
the logit and traditionally denoted by z (in our case, ).
Some simpler models1 simply give the logit as the output, but most
models apply a nonlinear function (also called a nonlinearity or
activation function and represented by ‘S’ in Fig. 4.1) to the logit to
produce the output. The output is traditionally denoted with y (in our
case the output of the zoomed neuron is )2 The nonlinearity can be
generically refered to as S(x) or by the name of the given function. The
most common function used is the sigmoid or logistic function. We have
encountered this function before, when it was the main function in
logistic regression. The logistic function takes the logit z and returns as
its output . The logistic function ‘squashes’ all it receives to
a value between 0 and 1, and the intuitive interpretation of its meaning
is that it calculates the probability of the output given the input.
A couple of remarks. Different layers may have different
nonlinearities which we shall see in the later chapters, but all neurons
of the same layer apply the same nonlinearity to its logits. Also, the
output of a neuron is the same value in every direction it sends it.
Returning to the zoomed neuron in Fig. 4.1, the neuron sends in to
directions, and both of them are the same value. As a final remark,
following Fig. 4.1 again, note that the logits in the next layer will be
calculated in the same manner. If we take, for example , it will be
calculated as . The same
Let us call this matrix (we can add subscripts or superscripts to its
name). Using matrix multiplication we get a matrix, namely
the column vector .
Where are the inputs, the weights, b is the bias and z is the logit.
The second equation defines the decision, which is usually done with
the nonlinearity, but here a binary step function is used instead (hence
the name). We take a digression to show that it is possible to absorb
the bias as one of the weights, so that we only need a weight update
rule. This is displayed in Fig. 4.3: to absorb the bias as a weight, one
needs to add an input with value 1 and the bias is its weight. Note
that this is exactly the same:
1. ,
2. ,
3. ,
4.
where total is the total price, the quant is the quantity and the ppk is
the price per kilogram for each component. Each meal has a total price
we know, and the quantities we know. So each meal places a linear
constraint on the ppk-s. But with only this we cannot solve it. If we plug
in this formula our initial (or subsequenlty corrected) ‘guesstimate’9 we
will get also the predicted value, and by comparing it with the true
(target) total value we will also get an error value which will tell us by
how much we missed. If after each meal we miss by less, we are doing a
great job.
Let us imagine that the true price is , ,
and . Let us start with a guesstimate of ,
, and . We know we bought 0.23 kg of
chicken, 0.15 kg of zucchini and 0.27 kg of rice and that we paid
3 € in total. By multiplying our guessed prices with the quantities we
get 1.38, 0.45 and 0.81, which totals to 2.64, which is 0.35 less than the
true price. This value is called the residual error, and we want to
minimize it over the course of future iterations (meals), so we need to
distribute the residual error to the ppk-s. We do this simply by changing
the ppk-s by:
where the denotes the target for the training case n (same for ,
Recall that z is the logit. Let us absorb the bias right away, so we do not
have to deal with it separately. We will calculate the derivation of the
logistic neuron with respect to the weights, and the reader can adapt
the procedure to the simpler linear neuron if she likes. As we noted
before, the chain rule is your best friend for obtaining derivations, and
the ‘middle variable’ of the chain rule will be the logit. The first part
which is equal to since (we absorbed the bias). By the
same argument .
We can now start deriving . We start with the definition for y, i.e.
with
Therefore,
Let us factorize the right-hand side in two factors which we will call A
and B:
It is obvious that from the definition of y. Let us turn our
attention to B:
that
The next thing we need is .12 We will be using the same rules for this
By applying LD we get
With LD we get
Note that this is very similar to the delta rule for the linear neuron, but
it has also extra: this part is the slope of the logistic
function.
4.6 Backpropagation
So far we have seen how to use derivatives to learn the weights of a
logistic neuron, and without knowing it we have already made excellent
progress with understanding backpropagation, since backpropagation
is actually the same thing but applied more than once to
‘backpropagate’ the errors through the layers. The logistic regression
(consisting of the input layer and a single logistic neuron), strictly
speaking, did not need to use backpropagation, but the weight learning
procedure described in the previous section actually is a simple
backpropagation. As we add layers, we will not have more complex
calculations, but just a large number of those calculations.
Nevertheless, there are some things to watch out for.
We will write out all the necessary details for backpropagation for
the feedforward neural networks, but first, we will build up the
intuition behind it. In Chap. 2 we have explained gradient descent, and
we will revisit some of the concepts here as needed. Backpropagation of
errors is basically just gradient descent. Mathematically speaking,
backpropagation is:
where w is the weigh, is the learning rate (for simplicity you can
think of it just being 1 for now) and E is the cost function measuring
overall performance. We could also write it in computer science
notation as a rule that assigns to w a new value:
2. Change back the weight to its initial value and subtract from it
and reevaluate the error (this will be )
the first thing we need to do is turn the difference between the output
and the target value into an error derivation. We have done this already
in the previous sections of this chapter:
weights:
The is the learning rate and the factor is here to make sure we go
towards minimizing E, otherwise we would be maximizing it. We can
also state it in vector notation17 to get rid of the indices:
We will address these issues in more detail later, but before that, we
will show a detailed calculation for error backpropagation in a simple
neural network, and in the next section, we will code the network. The
remainder of this chapter is probably the most important part of the
whole book, so be sure to go through all the details.
Let us see a working example18 of a simple and shallow feedforward
neural network. The network is represented in Fig. 4.5. Using the
notation, the starting weights and the inputs specified in the image, we
will calculate all the intricacies of the forward pass and
backpropagation for this network. Notice the enlarged neuron D. We
have used this to illustrate, where the logit is and how it becomes
the output of D ( ) by applying to it the logistic function .
Fig. 4.5 Backpropagation in a complete simple neural network
We will assume (as we did previously) that all the neurons have a
logistic activation function. So we need to do a forward pass, a
backpropagation, and a second forward pass to see the decrease in the
error. Let us briefly comment on the network itself. Our network has
three layers, with the input and hidden layers consisting of two
neurons, and the output error which consists of one neuron. We have
denoted the layers with capital letters, but we have skipped the letter E
to avoid confusing it with the error function, so we have neurons
named A, B, C, D and F. This is not usual. The usual procedure is to name
them by referring to the layer and neuron in the layer, e.g. ‘third neuron
in the first layer’ or ‘1, 3’. The input layer takes in two inputs, the
neuron A takes in and the neuron B takes in . The
target for this training case (consisting of and ) will be 1. As we
noted earlier, the hidden and output layers have the logistic activation
function (also called logistic nonlinearity), which is defined as
.
And now we use and as inputs to the neuron F which will give us
the final result:
Now, we need to calculate the output error. Recall that we are using the
mean squared error function, i.e. . So we plug in the target
Now we are all set to calculate the derivatives. We will explain how to
calculate and but all other weights are calculated with the same
procedure. As backpropagation proceeds in the opposite direction that
the forward pass, calculating is easier and we will do that first. We
need to know how the change in affects E and we want to take those
changes which minimize E. As noted earlier, the chain rule for
derivatives will do most of the work for us. Let us rewrite what we need
to calculate:
We have found the derivatives for all of these in the previous sections
so we will not repeat their derivations. Note that we need to use partial
derivatives because every derivation is made with respect to an indexed
term. Also, note that the vector containing all partial derivatives (for all
indices i) is the gradient. Let us address now. As we have seen
earlier:
to do is use these values in the general weight update rule20 (we use a
learning rate, ):
Now we can continue to the next layer. But an important note first. We
will be needing a value for and to find the derivatives of , ,
and , and we will be using the old values, not the updated ones.
We will update the whole network when we will have all the updated
weights. We proceed to the hidden layer. What we need to now is to find
the update for . Notice that to get from the output neuron F to we
need to go across C, so we will be using:
We start with:
Now we need :
And:
We can now make another forward pass with the new weights to
make sure that the error has decreased:
Which shows that the error has decreased. Note that we have processed
only one training sample, i.e. the input vector (0.23, 0.82). It is possible
to use multiple training samples to generate the error and find the
gradients (mini-batch training21), and we can do this a number of times
and each repetition is called an iteration. Iterations are sometimes
erroneously called epochs. The two terms are very similar and we can
consider them synonyms for now, but quite soon we will need to
delineate the difference, and we will do this in the next chapter.
An alternative to this would be to update the weights after every
single training example.22 This is called online learning. In online
learning, we process a single input vector (training sample) per
iteration. We will discuss this in the next chapter in more detail.
In the remainder of this chapter, we will integrate all the ideas we
have presented so far in a fully functional feedforward neural network,
written in Python code. This example will be fully functional Python 3.x
code, but we will write out some things that could be better left for a
Python module to do.
Technically speaking, in anything but the most basic setting, we
shall not use the SSE, but its variant, the mean squared error (MSE).
This is because we need to be able to rewrite the cost function as the
average of the cost functions for individual training samples x,
and we therefore define .
References
1. M. Hassoun, Fundamentals of Artificial Neural Networks (MIT Press, Cambridge, 2003)
[zbMATH]
2.
I.N. da Silva, D.H. Spatti, R.A. Flauzino, L.H.B. Liboni, S.F. dos Reis Alves, Artificial Neural
Networks: A Practical Course (Springer, New York, 2017)
3.
I. Goodfellow, Y. Bengio, A. Courville, Deep Learning (MIT Press, Cambridge, 2016)
[zbMATH]
4.
G. Montavon, G. Orr, K.R. Müller, Neural Networks: Tricks of the Trade (Springer, New York,
2012)
[Crossref]
5.
M. Minsky, S. Papert, Perceptrons: An Introduction to Computational Geometry (MIT Press,
Cambridge, 1969)
[zbMATH]
6.
P.J. Werbos, Beyond Regression: New Tools for Prediction and Analysis in the Behavioral
Sciences (Harvard University, Cambridge, 1975)
7.
D.B. Parker, Learning-logic. Technical Report-47 (MIT Center for Computational Research in
Economics and Management Science, Cambridge, 1985)
8.
Y. LeCun, Une procédure d’apprentissage pour réseau a seuil asymmetrique. Proc. Cogn. 85,
599–604 (1985)
9.
D.E. Rumelhart, G.E. Hinton, R.J. Williams, Learning internal representations by error
propagation. Parallel Distrib. Process. 1, 318–362 (1986)
Footnotes
1 These models are called linear neurons.
2 From linear neurons we still want to use the same notation but we set .
3 Formally speaking, all units using the perceptron rule should be called perceptrons, not just
binary threshold units.
4 The target is also called expected value or true label, and it is usually denoted by t.
5 As a simple application, think of an image recognition system for security cameras, where
one needs to classify numbers seen regardless of their orientation.
7 For example, if we only buy chicken, then it would be easy to get the price of the chicken
analytically as , and we get .
8 In practical terms this might seem far more complicated than simply asking the person
serving you lunch the price per kilogram for components, but you can imagine that the person
is the soup vendor from the soup kitchen from the TV show Seinfeld (116th episode, or
S07E06).
9 A guessed estimate. We use this term just to note that for now, we should keep things
intuitive an not guess an initial value of, e.g. 12000, 4533233456, 0.0000123, not because it will
be impossible to solve it, but because it will need much more steps to assume a form where we
could see the regularities appear.
10 Not in the sense that they are the same formula, but that they refer to the same process and
that one can be derived from the other.
11 For the sake of easy readability, we deliberately combine Newton and Leibniz notation in
the rules, since some of them are more intuitive in one, while some of them are more intuitive
in the second. We refer the reader back to Chap. 1 where all the formulations in both notations
were given.
12 Strictly speaking, we would need but this generalization is trivial and we chose the
13 A definition is circular if the same term occurs in both the definiendum (what is being
defined) and definiens (with which it is defined), i.e. on both sides of (or more precisely of
) and in our case this term could be w. A recursive definition has the same term on both sides,
but on the defining side (definiens) it has to be ‘smaller’ so that one could resolve the definition
by going back to the starting point.
14 If you recall, the perceptron rule also qualifies as a ‘simpler’ way of learning weights, but it
had the major drawback that it cannot be generalized to multiple layers.
15 Although it must be said that the whole field of deep learning is centered around
overcoming the problems with gradient descent that arise when using it in deep networks.
19 The only difference is the step for , where there is a 0 now for and a 1 for .
Sandro Skansi
Email: [email protected]
But, sometimes it is not that easy to find such a property. Trying to find
such a property is what a supervised machine learning algorithm does.
So the problem might be rephrased as trying to find a complex property
which defines a type as best as possible (by trying to include the
biggest possible number of tokens and try to include only the relevant
tokens in the definition). Therefore, overfitting can be understood in
another way: our classifier is so good that we are not only capturing the
necessary properties from our training examples, but also the non-
necessary or accidental properties. So, we would like to capture all the
properties which we need, but we want something to help us stop when
we begin including the non-necessary properties.
Underfitting and overfitting are the two extremes. Empirically
speaking, we can really go from high bias and low variance to high
variance and low bias. Want to stop at a point in between, and we want
this point to have better-than-average generalization capabilities
(inherited from the higher bias), and a good fit to the data (inherited
from high variance). How to find this ‘sweet spot’ is the art of machine
learning, and the received wisdom in the machine learning community
will insist it is best to find this by hand. But it is not impossible to
automate, and deep learning, wanting to become a contender for
artificial intelligence, will automate as much as possible. There is one
approach which tries to automate our intuitions about overfitting, and
this idea is called regularization.
Why are we talking about overfitting and not underfitting?
Remember that if have a very high bias we will end up with a linear
classifier, and linear classifiers cannot solve the XOR or similar simple
problems. What we want then is to significantly lower the bias until we
have reached the point after which we are overfitting. In the context of
deep learning, after we have added a layer to logistic regression, we
have said farewell to high bias and sailed away towards the shores of
high variance. This sounds very nice, but how can we stop in time? How
can we prevent overfitting. The idea of regularization is to add a
regularization parameter to the error function E, so we will have
One might wonder whether this would actually make the weights
converge to 0, but this is not the case, since the first component
will increase the weights if the reduction in error (this part controls the
unregularized error) is significant.
We can now proceed to briefly sketch regularization.
regularization, also known as ‘lasso’ or ‘basis pursuit denoising’ was
first proposed by Robert Tibshirani in 1996 [4]. regularization uses
the absolute value instead of the squares:
Let us return a bit to our bowl. So far we had a round bowl, but
imagine we have a shallow bowl of the shape of an elongated ellipse
(Fig. 5.3). If we drop the marble near the narrow middle, we will have
almost the same situation as before. But if we drop it on the marble at
the top left portion, it will move along a very shallow curvature and it
will take a very large number of epochs to find its way towards the
bottom of the bowl. The learning rate can help here. If we take only a
fraction of the move, the direction of the curvature for the next move
will be considerably better than if we move from one edge of a shallow
and elongated bowl to the opposing edge. It will make smaller steps but
it will find a good direction much more quickly.
This leaves us with discussing the typical values for the learning
rate . The values most often used are 0.1, 0.01, 0.001, and so on.
Values like 0.03 will simply get lost and behave very similarly to the
closest logarithm, which is 0.01 in case of 0.03.4 The learning rate is a
hyperparameter, and like all hyperparameters it has to be tuned on the
validation set. So, our suggestion is to try with some of the standard
values for a given hyperparameter and then see how it behaves and
modify it accordingly.
We turn our attention now to an idea similar to the learning rate,
but different called momentum, also called inertia. Informally speaking,
the learning rate controls how much of the move to keep in the present
step, while momentum controls how much of the move from the
previous step to keep in the current step. The problem which
momentum tries to solve is the problem of local minima. Let us return
to our idea with the bowl but now let us modify the bowl to have local
minima. You can see the lateral view in Fig. 5.4. Notice that the learning
rate was concerned with the ‘top’ view whereas the momentum
addresses problems with the ‘lateral’ view.
The marble falls down as usual (depicted as grey in the image) and
continues along the curvature, and stops when the curvature is 0
(depicted by black in the image). But the problem is that the curvature
0 is not necessarily the global minimum, it is only local. If it were a
physical system, the marble would have momentum and it would fall
over the local minimum to a global minimum, there it would go back
and forth a bit and then it would settle. Momentum in neural networks
is just the formalization of this idea. Momentum, like the learning rate
is added to the general weight update rule:
Where is the current weight to be computed, is the previous
value of the weight and was the value of the weight before that.
Dropout was first explained in [10], but one could find more details
about it in [11] and especially [12]. Dropout is a surprisingly simple
technique. We add a dropout parameter ranging from 0 to 1 (to be
interpreted as a probability), and in each epoch every weight is set to
zero with a probability of (Fig. 5.5). Returning to the general weight
update rule (where we need a for calculating the weight updates),
Just by looking at the amount of the weight update you might notice
that two weights have been updated with a significantly larger amount
than the other weights. These two weights ( and ) are the weights
connecting the output layer with the hidden layer. The rest of the
weights connect the input layer with the hidden layer. But why are they
larger? The reason is that we had to backpropagate through few layers,
and they remained larger: backpropagation is, structurally speaking,
just the chain rule. The chain rule is just multiplication of derivatives.
And, derivatives of everything we needed10 have values between 0 and
1. So, by adding layers through which we had to backpropagate, we
needed to multiply more and more 0 to 1 numbers, and this generally
tends to become very small very quickly. And this is without
regularization, with regularization it would be even worse, since it
would prefer small weights at all times (since the weight updates would
be small because of the derivatives, there would be little chance of the
unregularized part to increase the weights). This phenomena is called
vanishing gradient.
We could try to circumvent this problem by initializing the weights
to a very large value and hope that backpropagation will just chip them
to the correct value.11 In this case, we might get a very large gradient
which would also hinder learning since a step in the direction of the
gradient would be the right direction but the magnitude of the step
would take us farther away from the solution than we were before the
step. The moral of the story is that usually the problem is the vanishing
gradient, but if we change radically our approach we would be blown in
the opposite direction which is even worse. Gradient descent, as a
method, is simply too unstable if we use many layers through which we
need to backpropagate.
To put the importance of the the vanishing gradient problem, we
must note that the vanishing gradient is the problem to which deep
learning is the solution. What truly defines deep learning are the
techniques which make possible to stack many layers and yet avoid the
vanishing gradient problem. Some deep learning techniques deal with
the problem head on (LSTM), while some are trying to circumvent it
(convolutional neural networks), some are using different connections
than simple neural networks (Hopfield networks), some are hacking
the solution (residual connections), while some have been using weird
neural network phenomena to gain the upper hand (autoencoders).
The rest of this book is devoted to these techniques and architectures.
Historically speaking, the vanishing gradient was first identified by
Sepp Hochreiter in 1991 in his diploma thesis [15]. His thesis advisor
was Jürgen Schmidhuber, and the two will develop one of the most
influential recurrent neural network architectures (LSTM) in 1997 [16],
which we will explore in detail in the following chapters. An interesting
paper by the same authors which brings more detail to the discussion
of the vanishing gradient is [17].
We make a final remark before continuing to the second part of this
book. We have chosen what we believe to be the most popular and
influential neural architectures, but there are many more and many
more will be discovered. The aim of this book is not to provide a
comprehensive view of everything there is or will be, but to help the
reader acquire the knowledge and intuition needed to pursue research-
level deep learning papers and monographs. This is not a final tome
about deep learning, but a first introduction which is necessarily
incomplete. We made a serious effort to include a range of neural
architectures which will demonstrate to the reader the vast richness
and fulfilling diversity of this amazing field of cognitive science and
artificial intelligence.
References
1. A.N. Tikhonov, On the stability of inverse problems. Dokl. Akad. Nauk SSSR 39(5), 195–198
(1943)
[MathSciNet]
2.
A.N. Tikhonov, Solution of incorrectly formulated problems and the regularization method.
Sov. Math. 4, 1035–1038 (1963)
[zbMATH]
3.
M.A. Nielsen, Neural Networks and Deep Learning (Determination Press, 2015)
4.
R. Tibshirani, Regression shrinkage and selection via the lasso. J. Roy. Stat. Soc. Ser B
(Methodol.) 58(1), 267–288 (1996)
[MathSciNet][zbMATH]
5.
A. Ng, Feature selection, L1 versus L2 regularization, and rotational invariance, in
Proceedings of the International Conference on Machine Learning (2004)
6.
D.L. Donoho, Compressed sensing. IEEE Trans. Inf. Theory 52(4), 1289–1306 (2006)
[MathSciNet][Crossref]
7.
E.J. Candes, J. Romberg, T. Tao, Robust uncertainty principles: exact signal reconstruction
from highly incomplete frequency information. IEEE Trans. Inf. Theory 52(2), 489–509
(2006)
[MathSciNet][Crossref]
8.
J. Wen, J.L. Zhao, S.W. Luo, Z. Han, The improvements of BP neural network learning
algorithm, in Proceedings of 5th International Conference on Signal Processing (IEEE Press,
2000), pp. 1647–1649
9.
D.E. Rumelhart, G.E. Hinton, R.J. Williams, Learning internal representations by error
propagation. Parallel Distrib. Process. 1, 318–362 (1986)
10.
G.E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, Improving neural
networks by preventing co-adaptation of feature detectors (2012)
11.
G.E. Dahl, T.N. Sainath, G.E. Hinton, Improving deep neural networks for LVCSR using
rectified linear units and dropout, in IEEE International Conference on Acoustic Speech and
Signal Processing (IEEE Press, 2013), pp. 8609–8613
12.
N. Srivastava, G.E. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, Dropout: a simple
way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15, 1929–1958
(2014)
[MathSciNet][zbMATH]
13.
Y. Bengio, J. Louradour, R. Collobert, J. Weston, Curriculum learning, in Proceedings of the
26th Annual International Conference on Machine Learning, ICML 2009, New York, NY, USA,
(ACM, 2009), pp. 41–48
14.
R.S. Sutton, A.G. Barto, Reinforcement Learning: An Introduction (MIT Press, Cambridge,
1998)
[zbMATH]
15.
S. Hochreiter, Untersuchungen zu dynamischen neuronalen Netzen, Diploma thesis,
Technische Universität Munich, 1991
16.
S. Hochreiter, J. Schmidhuber, Long short-term memory. Neural Comput. 9(8), 1735–1780
(1997)
[Crossref]
17.
S. Hochreiter, Y. Bengio, P. Frasconi, J. Schmidhuber, Gradient flow in recurrent nets: the
difficulty of learning long-term dependencies, in A Field Guide to Dynamical Recurrent
Neural Networks, ed. by S.C. Kremer, J.F. Kolen (IEEE Press, 2001)
Footnotes
1 We will be using a modification of the explanation offered by [3]. Note that this book is
available online at http://neuralnetworksanddeeplearning.com.
2 We take the idea for this abstraction from Geoffrey Hinton’s courses.
3 This is actually also a technique which is used to prevent overfitting called early stopping.
4 You can use the learning rate to force a gradient explosion, so if you want to see gradient
explosion for yourself try with an value of 5 or 10.
5 We have been clumsy around several things, and this section is intended to redefine them a
bit to make them more precise.
6 We could use also a non-random selection. One of the most interesting ideas here is that of
learning the simplest instances first and then proceeding to the more tricky ones, and this
approach is called curriculum learning. For more on this see [13].
7 This is similar to reinforcement learning, which is, along with supervised and unsupervised
learning one of the three main areas of machine learning, but we have decided against including
it in this volume, since it falls outside of the the idea of a first introduction to deep learning. If
the reader wishes to learn more, we refer her to [14].
8 Suppose for the sake of clarification it is non-randomly divided: the first batch contains
training samples 1 to 1000, the second 1001 to 2000, etc.
9 A single hidden layer with two neurons in it. It it was (3, 2, 4, 1) we would know it has two
hidden layer, the first one with two neurons and the second one with four.
10 Ok, we have used the adjusted the values to make this statement true. Several of the
derivatives we need will become a value between 0 and 1 soon, but it the sigmoid derivatives
are mathematically bound between 0 and 1, and if we have many layers (e.g. 8), the sigmoid
derivatives would dominate backpropagation.
11 If the regular approach was something like making a clay statue (removing clay, but
sometimes adding), the intuition behind initializing the weights to large values would be taking
a block of stone or wood and start chipping away pieces.
© Springer International Publishing AG 2018
Sandro Skansi, Introduction to Deep Learning, Undergraduate Topics in Computer Science
https://doi.org/10.1007/978-3-319-73004-2_6
Sandro Skansi
Email: [email protected]
(6.1)
Padding in 2D is simply a ‘frame’ of n pixels around the image. Note that
it does not make much sense to use a padding of say 3 (pixels) if we use
only a 3 by 3 local receptive field, since it will only go one pixel over the
image border.
Fig. 6.3 A convolutional neural network with a convolutional layer, a max-pooling layer, a
flattening layer and a fully connected layer with one neuron
a 0 0 1 0 0 1 0
b 0 0 0 0 0 0 1
c 0 0 0 1 0 0 0
d 1 1 0 0 0 0 0
S 0 0 0 0 1 0 0
a 0 1 0 0 0 0 0
b 0 0 1 0 0 0 0
c 0 0 0 0 0 0 0
d 1 0 0 0 0 0 0
S 0 0 0 0 0 0 0
But for a convolutional neural networks, all input matrices must
have the same dimension, so we have an . All inputs for which
are clipped to and all of the inputs for which
are padded by adding enough zeros to the right side to make
their length exactly . This is why the authors used the reversing,
so that we loose only the more remote information at the beginning
when clipping, and not the more recent one at the end.
We might ask how to make a Keras-friendly dataset from these? The
first task is to view them as a tensor. This just means to collect all of the
M by matrices and add a third dimension along which they will
be ‘glued’. This simply means if we have 1000 M by matrices, that
we will make one M by by 1000 tensor. Depending on the
implementation you will use, it might make sense to make a 1000 by M
by tensor. Now initialize this tensor (a 3D Numpy array) with all
zeros, and devise a function which will put a 1 where it should be. Try
to write Keras code which implements this architecture. As always, if
you get stuck, StackOverflow it. If you have never done anything similar
before, it might take you even a week12 to get it to work, even though
the end result does not have many lines of code. This is a great exercise
in deep learning, so don’t skip it.
References
1. Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based learning applied to document
recognition. Proc. IEEE 86(11), 2278–2324 (1998)
[Crossref]
2.
D.H. Hubel, T.N. Wiesel, Receptive fields and functional architecture of monkey striate cortex.
J. Physiol. 195(1), 215–243 (1968)
[Crossref]
3.
X. Zhang, J. Zhao, Y. LeCun, Character-level convolutional networks for text classification, in
Advances in Neural Information Processing Systems 28, NIPS (2015)
Footnotes
1 Yann LeCun once told in an interview that he prefers the name ‘convolutional network’
rather than ‘convolutional neural network’.
2 An image in this sense is any 2D array with values between 0 and 255. In Fig. 6.1 we have
numbered the positions, and you may think of them as ‘cell numbers’, in the sense that they will
contain some value, but the number on the image denotes only their order. In addition, note
that if we have e.g. 100 by 100 RGB images, each image would be a 3D array (tensor) with
dimensions (100, 100, 3). The last dimension of the array would hold the three channels, red,
green and blue.
3 Here you might notice how important is weight initialization. We do have some techniques
that are better than random initialization, but to find a good weight initialization strategy is an
important open research problem.
4 If using padding we will keep the same size, but still expand the depth. Padding is useful
when there is possibly important information on the edges of the image.
5 You have everything you need in this book to get the array (tensor) with the feature maps,
and even to squash it to 2D, but you might have to search the Internet to find out how to
visualize the tensor as an image. Consider it a good (but advanced) Python exercise.
6 If it has 100 neurons per layer, with only one output neuron, that makes the total of
parameters, and that is without the
biases!.
10 Trivially, every paper will have a ‘trickiest part’, and it is your job to learn how to decode
this part, since it is often the most important part of the paper.
11 Since the whole alphabet will not fit on a page, but you can easily imagine how it will
expand to the normal English alphabet.
Sandro Skansi
Email: [email protected]
(7.2)
(7.3)
(7.4)
We can make this more readable by condensing it to two equations:
(7.5)
(7.6)
where is the nonlinearity of the hidden layer, and is the
nonlinearity of the output layer, which are not necessarily the same
function, but they can be the same if we want. This type of recurrent
neural network is called Elman networks [3], after the linguist and
cognitive scientist Jeffrey L. Elman.
If we change the for in Eq. 7.5, so that it becomes
as follows:
(7.7)
Fig. 7.4 Cell state (a), forget gate (b), input gate (c) and output gate (d)
The first gate is the forget gate, which is emphasized in Fig. 7.4b.
The name ‘gate’ comes from analogies with the logic gates. The forget
gate at unit t is denoted by f(t), and is simply
. Intuitively, it controls how much of the
weighted raw input and weighted previous hidden state is to be
remembered. Note that the is the symbol for the logistic function.
Regarding weights, there are different approaches, but we consider
the most intuitive to be the one which breaks up into several
different weights, , , and .5 The point to remember is
that there are different ways to look at the weights and some of them
try to keep the same names as they had in simpler models, but the most
natural approach for deep learning is to think of an architecture as
composed of basic ‘building blocks’ to be assembled together like
LEGO® bricks, and then each block should have its own set of weights.
All of the weight in a complete neural network are trained together
with backpropagation and the joint training actually makes a neural
network a connected whole (like each LEGO brick normally has its own
studs to connect to other bricks to make a structure).
The next gate (emphasized in Fig. 7.4c), called the input gate, is a bit
more complex. It basically decides on what to put in the cell state. It is
composed of another forget gate (which we unimaginatively denote
with ) but with different weights, but it also has an additional
module which creates candidates to be added to the cell state. The
can be thought of as a saving mechanism, which controls how
much of the input we will save to the cell state. In symbols:
(7.8)
(7.9)
And now finally, we have the complete LSTM. Just a quick final remark:
the can be thought of as a ‘focus’ mechanism which tries to say
what is the most important part of the cell state. You might think of
, and , but the idea is that they all participate in
different parts and as such, we hope they will take on the mechanism
we want (‘remember from last unit’, ‘save input’ and ‘focus on this part
of the cell state’ respectively). Remember that this is only our wild
hope, we have no way to ‘force’ this interpretation on the LSTM other
than with the sequence of calculations or flow of information we have
chosen to use. This means that these interpretations are metaphorical,
and only if we have made a one-in-a-million lucky guesstimate will
these mechanisms actually coincide with the mechanisms in the human
brain.
The LSTMs have been first proposed by Hochreiter and
Schmidhuber in 1997 [2], and they have become one of the most
important deep architectures for natural language processing, time
series analysis and many other sequential tasks. Today one of the best
reference books on recurrent neural networks is [5], and we highly
recommend it for any reader that wishes to specialize in these amazing
architectures.
(7.13)
Let us see how this works in a whole example. Say, we want to calculate
the gradient for :
(7.15)
This means that for the time component plays no part. As expected,
for ( is similar) it is a bit different which is as follows:
(7.16)
(7.17)
References
1. J.J. Hopfield, Neural networks and physical systems with emergent collective computational
abilities. Proc. Nat. Acad. Sci. U.S.A 79(8), 2554–2558 (1982)
[MathSciNet][Crossref]
2.
S. Hochreiter, J. Schmidhuber, Long short-term memory. Neural Comput. 9(8), 1735–1780
(1997)
[Crossref]
3.
J.L. Elman, Finding structure in time. Cogn. Sci. 14, 179–211 (1990)
[Crossref]
4.
M.I. Jordan, Attractor dynamics and parallelism in a connectionist sequential machine, in
Proceedings of the 26th Annual International Conference on Machine Learning, Erlbaum, NJ,
USA (Cognitive Science Society, 1986), pp. 531–546
5.
A. Graves, Supervised Sequence Labelling with Recurrent Neural Networks (Springer, New
York, 2012)
[Crossref]
6.
A. Gulli, S. Pal, Deep Learning with Keras (Packt publishing, Birmingham, 2017)
Footnotes
1 In machine learning literature, it is common to find the notation , which denotes the results
from the predictor, and y is kept for denoting target values. We have used a different notation,
more common to deep learning, where y denotes the outputs from the predictor, and t is used to
denote actual values or targets.
3 We used the shades of grey just to visually denote the gradual transition to the proper
notation.
5 Notice that we are not quite precise here and that the in the LSTMs is actually the same
as in the SRN and not a component of the old .
6 Which you can get either from the book’s GitHub repository, or by typing in all the code in
this section in one simple file (.txt) and rename it to change its extension to .py.
8 Where we have more than two classes. Note that in binary classification were we have two
classes, say A and B, we actually do a classification (with, for e.g. the logistic function in the
output layer) in only one of them and get a probability score . The probability score of B is
then calculated as .
9 This is perhaps the single most challenging task in this book, but do not skip it since it will be
extremely useful for a good understanding, and it is just four lines of code.
© Springer International Publishing AG 2018
Sandro Skansi, Introduction to Deep Learning, Undergraduate Topics in Computer Science
https://doi.org/10.1007/978-3-319-73004-2_8
8. Autoencoders
Sandro Skansi1
Sandro Skansi
Email: [email protected]
(8.1)
(8.2)
(8.3)
The only condition is that all eigenvectors are linearly independent.
Since is a symmetrical matrix with linearly independent
eigenvectors, we can use the eigendecomposition to get the following
equations which hold for any covariance matrix :
(8.4)
(8.5)
(8.6)
(8.7)
(8.8)
(8.9)
We now have to choose a matrix Q so that we get what we want
(correlation zero and features ordered according to variance). We
simply chose . Then we have:
(8.10)
Let us see what we have achieved. All elements except the diagonal
elements of are zero, which means that the only correlation left in Z
is along the diagonal. This is the covariance of a variable with itself,
which is actually the variance we have encountered earlier, and the
matrix is ordered in descending variance (
). This is everything we wanted. Note that
we have done PCA for the 2D case with matrices but the same ideas
hold for tensors. More on the principal component analysis can be
found in [2].
So we have seen how we can create a different representation of the
same data such that the features it is described with have a covariance
of zero, and are sorted by variance. In doing so we have created a
distributed representation of the data, since a column named ‘height’
does not exist anymore, and we have synthetic columns. The point here
is that we can build various distributed representations, but we have to
know what constraint we want the final data to obey. If we want this
constraint to be left unspecified and we want to specify it not directly
but by feeding examples, then we will have to employ a more general
approach. This is the approach that leads us to autoencoders, which
offer a surprising generality across many tasks.
Fig. 8.1 Plain vanilla autoencoder, simple autoencoder, sparse autoencoder, denoising
autoencoder, contractive autoencoder
All of the autoencoders are used to preprocess data for a simple
feed-forward neural network. This means that we have to get the
preprocessed data from the autoencoder. This data is not the output of
the whole autoencoder, but the output of the middle (hidden) layer,
which is the layer that does the donkey work.
Let us address a technical issue. We have seen but not formally
introduced the concept of a latent variable. A latent variable is a
variable which lies in the background and is correlated with one or
many ‘visible’ variables. We have seen an example in Chap. 3 when we
addressed PCA in an informal manner, and we had synthetic properties
behind ‘height’ and ‘weight’. These are a prime example of a latent
variable. When we hypothesize a latent variable (or create it), we
postulate we have a probability distribution to define it. Note that it is a
philosophical question whether we discover or define latent variables,
but it is clear that we want our latent variables (the defined ones) to
follow as closely as possible the latent variables in nature (the ones that
we measure or discover). A distributed representation is a probability
distribution of latent variables which hopefully are the objective latent
variables and learning will conclude when they are very similar. This
means that we have to have a way of measuring similarities between
probability distributions. This is usually done via the Kullback-Leibler
divergence, which is defined as:
(8.11)
Fig. 8.2 Stacking a (4, 3, 4) and a (4, 2, 4) autoencoder resulting in a (4, 3, 2, 3, 4) stacked
autoencoder
References
1. S. Axler, Linear Algebra Done Right (Springer, New York, 2015)
[zbMATH]
2.
R. Vidal, Y. Ma, S. Sastry, Generalized Principal Component Analysis (Springer, London, 2016)
[Crossref]
3.
I. Goodfellow, Y. Bengio, A. Courville, Deep Learning (MIT Press, Cambridge, 2016)
[zbMATH]
4.
D.H. Ballard, Modular learning in neural networks, in AAAI-87 Proceedings (AAAI, 1987), pp.
279–284
5.
Y. LeCun, Modeles connexionnistes de l’apprentissage (Connectionist Learning Models)
(Université P. et M. Curie (Paris 6), 1987)
6.
P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, P.-A. Manzagol, Stacked denoising autoencoders:
learning useful representations in a deep network with a local denoising criterion. J. Mach.
Learn. Res. 11, 3371–3408 (2010)
[MathSciNet][zbMATH]
7.
Q.V. Le, M.A. Ranzato, R. Monga, M. Devin, K. Chen, G.S. Corrado, J. Dean, A.Y. Ng, Building
high-level features using large scale unsupervised learning, in Proceedings of the 29th
International Conference on Machine Learning. ICML (2012)
Footnotes
1 The expected value is actually the weighted sum, which can be calculated from a frequency
table. If 3 out of five students got the grade ‘5’, and the other two got a grade ‘3’,
.
2 We omit the proof but it can be found in any linear algebra textbook, such as e.g. [1].
3 Numpy is the Python library for handling arrays and fast numerical computations.
4 Try’adam’.
5 Try’binary_crossentropy’.
© Springer International Publishing AG 2018
Sandro Skansi, Introduction to Deep Learning, Undergraduate Topics in Computer Science
https://doi.org/10.1007/978-3-319-73004-2_9
Sandro Skansi
Email: [email protected]
(9.1)
Context Word
‘are’ ‘who’
‘who’, ‘you’ ‘are’
‘are’, ‘that’ ‘you’
‘you’, ‘you’ ‘that’
‘that’, ‘do’ ‘you’
‘you’, ‘not’ ‘do’
‘do’, ‘know’ ‘not’
‘not’, ‘your’ ‘know’
‘know’, ‘history’ ‘your’
‘your’ ‘history’
References
1. R.W. Hamming, Error detecting and error correcting codes. Bell Syst. Tech. J. 29(2), 147–
160 (1950)
[MathSciNet][Crossref]
2.
V.I. Levenshtein, Binary codes capable of correcting deletions, insertions, and reversals. Sov.
Phys. Dokl. 10(8), 707–710 (1966)
[MathSciNet]
3.
M.A. Jaro, Advances in record linkage methodology as applied to the 1985 census of tampa
florida. J. Am. Stat. Assoc. 84(406), 414–420 (1989)
[Crossref]
4.
W.E. Winkler, String comparator metrics and enhanced decision rules in the fellegi-sunter
model of record linkage, in Proceedings of the Section on Survey Research Methods
(American Statistical Association, 1990), pp. 354–359
5.
A. Singhal, Modern information retrieval: a brief overview. Bull. IEEE Comput. Soc. Tech.
Comm. Data Eng. 24(4), 35–43 (2001)
6.
T. Mikolov, T. Chen, G. Corrado, J. Dean, Efficient estimation of word representations in
vector space, in ICLR Workshop (2013), arXiv:1301.3781
7.
Z. Harris, Distributional structure. Word 10(23), 146–162 (1954)
[Crossref]
8.
J.R. Firth, A synopsis of linguistic theory 1930–1955, in Studies in Linguistic Analysis
(Philological Society, 1957), pp. 1–32
9.
L. Wittgenstein, Philosophical Investigations (MacMillan Publishing Company, London,
1953)
[zbMATH]
10.
H. Moravec, Mind Children: The Future of Robot and Human Intelligence (Harvard University
Press, Cambridge, 1988)
Footnotes
1 If the context were 2, it would take 4 words, two before the main word and two after.
2 If we were to save and load from a H5 file, we would be saving ans loading all the weights in a
new network of the same configuration, possibly fine-tuning them and then taking out just the
weight matrix with the same code we used here.
3 More precisely: to transform the matrix into a decorrelated matrix whose columns are
arranged in descending variance and then keep the first two columns.
© Springer International Publishing AG 2018
Sandro Skansi, Introduction to Deep Learning, Undergraduate Topics in Computer Science
https://doi.org/10.1007/978-3-319-73004-2_10
Sandro Skansi
Email: [email protected]
(10.2)
(10.3)
The as learning progresses, ENE either stays the same or diminishes,
and this is how Hopfield networks reach local minima. Each local
minimum is a memory of some training samples. Remember logical
functions and logistic regression? We needed two input neurons and
one output neurons for conjunction and disjunction, and an additional
hidden one for XOR. We need three neurons in Hopfield networks for
conjunction and disjunction and four for XOR.
backpropagate it.
We turn to a subclass of Boltzmann machines, called restricted
Boltzmann machines (RBM) [3]. Structurally speaking, restricted
Boltzmann machines are just Boltzmann machines where there are no
connections between neurons of the same layer (hidden to hidden and
visible to visible). This seems like a minor point, but this actually makes
it possible to use a modification of the backpropagation used in feed-
forward networks. The restricted Boltzmann machine therefore has
two layers, a visible, and a hidden. The visible layer (this is true for
Boltzmann machines in general) is the place where we put in inputs
and read out outputs. Denote the inputs with , the biases of the
hidden layer with . Then, during the forward pass (see Fig. 10.2b),
7. Counting: 68%
8. Lists: 77%
References
1. J.J. Hopfield, Neural networks and physical systems with emergent collective computational
abilities. Proc. Nat. Acad. Sci. U.S.A 79(8), 2554–2558 (1982)
[MathSciNet][Crossref]
2.
D.H. Ackley, G.E. Hinton, T. Sejnowski, A learning algorithm for boltzmann machines. Cogn.
Sci. 9(1), 147–169 (1985)
[Crossref]
3.
P. Smolensky, Information processing in dynamical systems: foundations of harmony
theory, in Parallel Distributed Processing: Explorations in the Microstructure of Cognition, ed.
by D.E. Rumelhart, J.L. McClelland, the PDP Research Group, (MIT Press, Cambridge)
4.
G.E. Hinton, S. Osindero, Y.-W. Teh, A fast learning algorithm for deep belief nets. Neural
Comput. 18(7), 1527–1554 (2006)
[MathSciNet][Crossref]
5.
Y. Bengio, P. Lamblin, D. Popovici, H. Larochelle, Greedy layer-wise training of deep
networks, in Proceedings of the 19th International Conference on Neural Information
Processing Systems (MIT Press, Cambridge, 2006), pp. 153–160
6.
Y. Bengio, Learning deep architectures for AI. Found. Trends Mach. Learn. 2(1), 1–127
(2009)
[MathSciNet][Crossref]
7.
I. Goodfellow, Y. Bengio, A. Courville, Deep Learning (MIT Press, Cambridge, 2016)
[zbMATH]
8.
W. Bechtel, A. Abrahamsen, Connectionism and the Mind: Parallel Processing, Dynamics and
Evolution in Networks (Blackwell, Oxford, 2002)
9.
A. Graves, G. Wayne, I. Danihelka, Neural turing machines (2014), arXiv:1410.5401
10.
J. Weston, S. Chopra, A. Bordes, Memory networks, in ICLR (2015), arXiv:1410.3916
11.
S. Sukhbaatar, A. Szlam, J. Weston, End-to-end memory networks (2015), arXiv:1503.08895
12.
J. Weston, A. Bordes, S. Chopra, A.M. Rush, B. van Merriënboer, A. Joulin, T. Mikolov, Towards
ai-complete question answering: A set of prerequisite toy tasks, in ICLR (2016), arXiv:1502.
05698
13.
T. Winograd, Understanding Natural Language (Academic Press, New York, 1972)
[Crossref]
Footnotes
1 For a fully detailed view, see the blog entry of one of the creators of the NTM, https://
medium.com/aidangomez/the-neural-turing-machine-79f6e806c0a1.
2 By default, memory networks make one hop, but it has been shown that multiple hops are
beneficial, especially in natural language processing.
3 Winograd sentences are sentences of a particular form, whare the computer should resolve
the coreference of a pronoun. They were proposed as an alternative to the Turing test, since the
turing test has some deep flaws (deceptive behaviour is encouraged), and it is hard to quantify
its results and evaluate it on a large scale. Winograd sentences are sentances of the form ‘I tried
to put the book in the drwer but it was too [big/small]’, and they are named after Terry
Winograd who first considered them in the 1970s [13].
© Springer International Publishing AG 2018
Sandro Skansi, Introduction to Deep Learning, Undergraduate Topics in Computer Science
https://doi.org/10.1007/978-3-319-73004-2_11
11. Conclusion
Sandro Skansi1
Sandro Skansi
Email: [email protected]
9. Can we prove theoretical results for deep learning which use more
than just formalized simple networks with linear activations
(threshold gates)?
12. Are local minima a fact of life or only an inherent limitation of the
presently used architectures? It is known that by adding hand-
crafted features helps, and that deep neural networks are capable
of extracting features themselves, but why do they get stuck?
Curriculum learning helps a lot in some cases, and we can ask
whether the curriculum is necessary for some tasks?
14. Can deep networks be adapted to learn from trees and graphs, not
just vectors?
Reference
1. Y. Bengio, Learning deep architectures for AI. Found. Trends Mach. Learn. 2(1), 1–127
(2009)
Footnotes
1 Books, journal articles, Arxiv, Coursera, Udacity, Udemy, etc—there is a vast universe of
resources out there.
2 I do not know whose proverb it is, but I do know it was someone’s, and I would be very
grateful if a reader who knows the author contacts me.
Index
A
Accidental properties
Accuracy
Activation function
Adaptable learning rate
Analogical reasoning
Artificial intelligence
Autoencoder
B
BAbI
Backpropagation
Backpropagation through time
Bag
Bag of words
Bayes theorem
Benchmark
Bernoulli distribution
Bias
Bias absorption
Binary threshold neuron
Binning
Boltzmann machine
C
Categorical features
CBOW
Centroid
Chain rule
Classification
Clustering
Cognitive science
Committee
Confusion matrix
Connectionism
Continuous function
Contrastive divergence
Convergence
Convolutional layer
Convolutional neural network
Corpus
Correlation
Cosine similarity
Covariance
Covariance matrix
Cross-entropy error function
Curriculum learning
D
Datapoint
Dataset
1D convolutional layer
2D convolutional layer
Deep belief networks
Delta rule
Distributed representations
DMB
Dot product
Dropout
Dunn coefficient
E
E
Early stopping
Eigendecomposition
Eigenvalue
Eigenvectors
Elman networks
Epoch
Error function
Estimator
Euclidean distance
Euclidean norm
Expected value
F
False negative
False positive
Feature engineering
Feature maps
Features
Feed-forward neural network
Finite difference approximation
Forward pass
Fragment
Fully-connected layer
Function minimization
G
Gaussian cloud
Gaussian distribution
General weight update rule
GOFAI
Gradient
Gradient descent
H
Hadamard product
Hopfield networks
Hyperbolic tangent
Hyperparameter
Hyperplane
I
Iterable
Iteration
J
Jordan networks
K
K-means
Kullback-Liebler divergence
L
L1 regularization
L2 norm
L2 pooling
L2 regularization
Label
Latent variable
Learning rate
Limit
Linear combination
Linear constraint
Linearly separable
Linear neuron
List
Local minima
Local receptive field
Local representations
Logistic function
Logistic neuron
Logistic regression
Logit
Low faculty reasoning
LSTM
M
Markov assumption
Matrix transposition
Max-pooling
Mean squared error
Median
Memory networks
MNIST
Mode
Momentum
Momentum rate
Monotone function
MSE
Multiclass classification
Multilayered perceptron
Mutability
N
Naive Bayes classifier
Necessary property
Neural language models
Neural turing-machines
Neuron
Noise
Nonlinearity
Normalized vector
Numpy
O
One-hot encoding
Online learning
Ordinal feature
Orthogonal matrix
Orthogonal vectors
Orthonormal
Overfitting
P
Padding
Parameters
Parity
Partial derivative
PCA
Perceptron
Positive-definite matrix
Precision
Prior
Probability distribution
Python indentation
Q
Qualia
R
Recall
Receptive field
Recurrent neural networks
Regularization
Regularization rate
Reinforcement learning
ReLU
Restricted Bolzmann machine
Ridge regression
Row vector
S
Scalar multiplication
Sentiment analysis
Shallow neural networks
Sigmoid function
Simple recurrent networks
Skip-gram
Softmax
Sparse encoding
Square matrix
Standard basis
Standard deviation
Step function
Stochastic gradient descent
Stride
Supervised learning
Support vector machine
Symmetric matrix
T
Target
Tensor
Test set
Tikhonov regularization
Train-test split
True negative
True positive
Tuple
U
Underfitting
Uniform distribution
Unit matrix
Unsupervised learning
Urelements
V
Validation set
Vanishing gradient
Variance
Vector
Vector component
Vector dimensionality
Vector space
Voting function
W
Weight
Weight decay
Winograd sentences
Word embedding
Word2vec
Workhorse neuron
X
XOR
Z
Zero matrix