Neural Networks and Deep Learning - Deep Learning Explained To Your Granny - A Visual Introduction For Beginners Who Want To Make Their Own Deep Learning Neural Network (Machine Learning)
Neural Networks and Deep Learning - Deep Learning Explained To Your Granny - A Visual Introduction For Beginners Who Want To Make Their Own Deep Learning Neural Network (Machine Learning)
By
Pat Nakamoto
Contents
Introduction
Neural Networks and Deep Learning
Chapter 1. A brief introduction to Machine Learning
Notes to this Chapter
What is Machine Learning?
Two main Types of Machine Learning Algorithms
A practical example of Unsupervised Learning
Key points of this Chapter
Chapter 2. Neural Networks
Notes to this Chapter
What are Neural Networks?
McCulloch-Pitts's Neuron
Types of activation function
Types of network architectures
Learning processes
Advantages and disadvantages
Key points of this Chapter
Chapter 3. Deep Learning
Notes to this Chapter
Let us give a memory to our Neural Network
The example of book writing Software
Deep learning: the ability of learning to learn
How does Deep Learning work
Main architectures and algorithms
Main types of DNN
Available Frameworks and libraries
Key points of this Chapter
Chapter 4. Convolutional Neural Networks
Notes to this Chapter
Deep Learning and Convolutional Neural Networks (CNNs)
Tunnel Vision
Convolution
The right Architecture for a Neural Network
Test your Neural Network
Key points of this Chapter
CONCLUSIONS
Introduction
“A sinister threat is brewing deep inside the technology laboratories of
Silicon Valley. Artificial Intelligence, disguised as helpful digital assistants
and self-driving vehicles, is gaining a foothold -- and it could one day spell
the end for mankind”.
ELLIE ZOLFAGHARIFARD, Mail Online
You certainly have noticed that Machine Learning and Artificial Intelligence
are seemingly the biggest technological hype waves of the moment. World
giant players, such as Apple, Amazon, IBM, Google and Facebook, heavily
invest in machine learning research and applications for good reasons.
However, what is this hype all about?
Machine Learning applications are already very numerous today, and some
have entered our daily lives without us actually realizing it. For example,
think of search engines: when you insert one or more keyword, a search
engine returns a list of result, called SERP (Search Engine Results Pages) that
are the effect of Machine Learning algorithms using unsupervised learning
(providing as output information deemed relevant for the research carried out,
based on the analysis of schemas, models, and structures in data).
Another common example is email spam filters based on Machine Learning
systems that learn continuously to either intercept suspicious or fraudulent e-
mail messages and act accordingly (such as deleting them before they are
deployed on the user’s personal mailbox). Systems of this type, even with
greater sophistication, are used in the financial sector for the prevention of
fraud (such as credit card cloning) and the theft of identity data.
Algorithms learn to act by correlating events, user habits, spending
preferences, etc. Through this information, algorithms can identify in real-
time any abnormal behavior that could actually identify a theft or fraud. We
can also find interesting examples in the medical field, where algorithms
learn to make accurate predictions to prevent epidemic outbreaks or to
diagnose cancer or rare diseases in an accurate and timely manner.
Neural Networks and Deep Learning
Since their birth in the ‘50ies, Artificial Neural Networks have been labeled
by experts as one of the most promising areas of science within the Machine
Learning field. (Artificial) Neural Networks are computational models
inspired by the functioning of the human brain. Their main feature is the
ability to learn during a training phase and then generalize the knowledge
acquired to predict new situations. Just like a human brain, these networks
have an internal memory that increases through experience.
Deep learning is a term that indicates a particular approach to the design,
development, testing and especially training of neural networks. The birth
and development of Deep Learning has been made possible thanks to the
technological improvements that have appeared at the end of the 2000s, such
as GPU, that is, large arrays of small processors designed to process images
in video cards of Nvidia. This technology, initially created for video games,
was then discovered to be very fast in the processing of multilayer neural
networks, with much better performances than normal CPUs. The second
innovation that has determined the definitive advent of deep learning is the
Big Data phenomenon, that is the availability of massive and kaleidoscopic
amounts of data thanks to the diffusion and growth of the Internet and of the
services connected to it. In fact, now there are billions of images ready, as
test data, both marked and unmarked, which can be used for developing
experience programs; years of telephone exchanges registered, to have audio
samples of thousands of different voices, millions of digitized texts
available…in a few words a ‘planet database’ containing Zettabytes of
information to use!
Before diving into this exciting world, I want to explain that the goal of this
book is to offer information that is highly informative, yet accessible to
anyone - which entails using many generalizations, so bear with me,
generalizations/some degree of simplification may be necessary.
Nevertheless, if this guide can make someone more passionate about Deep
Learning, my mission can be considered as fulfilled.
Chapter 1. A brief introduction to
Machine Learning
Notes to this Chapter
The aim of this Chapter is to introduce the general concepts of machine
learning, the two main types of learning and some basic terminology, as
general basis for introducing Neural Networks in Chapter 2.
What is Machine Learning?
Imagine being a professor and entering a class in which students are ll
computer machines, all sitting neatly with an apron and lunch-box, ready for
you to teach. Your role, in that classroom, is to teach them to make decisions
autonomously. B eing a regular teacher will not help you to succeed. In order
to make your computer-students autonomous, you will need to know
algorithms in the field of automatic learning. This field has often been
associated with artificial intelligence, and more specifically, computational
intelligence. Computational intelligence is a method of data analysis that
points to the automatic creation of analytic models. That is, allowing a
computer to work out concepts, evaluate decisions, take decisions, and
predict future options. Obviously, these choices must be able to adapt to new
situations and not be standard. Systems based on this type of learning are at
the basis of the development of self-driving cars, for example. Machine
Learning hence has to do with the ability of a computer to learn, given a input
array of data (for example data collected through sensors, GPS, etc.) and
through algorithms, how to recognize the surrounding environment adapting
their "behavior" according to the specific situations they face (becoming able
even to drive a car).
Yet, for a novice, the theoretical concepts behind machine learning can be
quite overwhelming. Of course, we cannot discuss all the nitty-gritty details
about all the different algorithms and applications that have emerged in the
last 65 years, however, in this book, we will embark on an exciting journey
that covers some relevant basics.
So, Machine Learning refers to the fact that there are generic algorithms that
can give interesting information about a certain data set without you having
to write any specific code for the problem. Instead of writing code, the data is
inserted into a generic algorithm and the algorithm generates its own logic
based on the input. For example, one of these algorithms is the Classification
Algorithm. I can enter mixed heterogeneous data, and my algorithm is able to
identify autonomously common features classifying the data into several
groups. This same classification algorithm can be used for purposes that
range from recognizing handwritten numbers to classifying e-mails,
indicating whether they are spam or not, without having to change any line of
code. It is the same algorithm, but it is powered by different Training Data so
that different classification logic emerges. Machine Learning is an umbrella
term that includes many of these generic algorithms.
Two main Types of Machine Learning Algorithms
Machine Learning algorithms can be divided in two main categories -
Supervised Learning and Unsupervised Learning algorithms. The difference
is simple but very important:
Supervised Learning algorithms: Let us suppose you are the owner of a
car dealership centre. Your business is growing, and you have to hire new
staff and train them to help you. But there is a problem – it just takes a
glance to you at a certain used car to havea clear idea of how much it is
worth, but your new employees do not have a clue, since they do not have
your experience in estimating car prices. To help your employees (and
perhaps get a few free days to go on vacation), you decide to write a small
application that can estimate the value of a car based on its registration
year, engine size, travelled kilometers and the average selling price of the
same brand of car. For 3 months, you register the price at which certain
models of cars sell, writing down the details of the transaction – year of
the car, model, kilometers, power , etc., and last but not least the final sale
price. Using this training data, you can create a program that can estimate
what could be the right sales price of any other car of the same models
(let’s say Mercedes and Audi). This is called supervised learning. You
know the price at which each car has been sold, so knowing the previous
answers to the problem, you are able to work backwards and understand
the logic to apply in order to solve new similar problems. To build your
application, you have to feed the Machine Learning algorithm with the
data you collected regarding each car. The algorithm will try to understand
what kind of mathematical functions it must use to produce the solution
for new problems. Once you know what math function applies to a
specific set of problems, you will be able to produce a solution for any
other problem of the same type!
Unsupervised Learning: Let us go back to our example with the car-
dealership center owner. How would you create the same application
without knowing the sales price of each new car? Even if you know only
the kilometers and model of each car, you can yield interesting
results….this is unsupervised learning. It's like someone giving you a list
of numbers on a piece of paper and telling you, "I do not really know what
these numbers mean, but maybe you can figure out if there's a scheme or
code or something at the base of them - Have fun! ".
So, what could you do with this data? To begin with, you might have an
algorithm that automatically identifies the different market segments in the
data. You might find that Audis are bought at higher prices below a certain
mileage, but Mercedes’s can be priced like gold even over a certain
number of kilometers. Knowing these different types of customer choices,
can help you better manage your marketing work.
Another interesting thing you could do is to automatically identify any
abnormal values that are very different from all the others. Maybe Audis
that are sold at abnormally high prices all are of a certain model and year,
and you can try to concentrate your salespersons on promoting those so
you get higher commissions.
Other than these, there are also other kinds of Machine Learning algorithms:
Reinforcement Learning algorithms: In this case, the system has to
interact with a dynamic environment (which allows it to have input data)
and reach a goal (receiving a reward accordingly), also learning from
errors (identified through "punishments"). The behavior (and performance)
of the system is determined by a routine of learning based on reward and
punishment. With such a model, the computer learns, for example, to beat
a rival in a game (or to drive a vehicle) by concentrating efforts on
carrying out a given task. While doing this, it is aiming at reaching the
maximum value of the reward; in other words, the system learns to play
(or to drive) by improving performance as a result of previously achieved
results.
Semi-supervised learning: this is a "hybrid" model in which the computer
is provided with an incomplete set of training / learning data; some of
these inputs are "endowed with" their respective output examples (as in
supervised learning), while others are missing (as in unsupervised
learning). The underlying objective is always the same: identify problem-
solving rules and functions as well as data structures that are useful for
achieving certain goals.
Other Practical Approaches in Machine Learning: There are also other
Machine Learning subcategories, if we are thinking in "practical" terms.
These approaches range from probabilistic models to Deep Learning,
which is the main topic of this book. For example, we can think of the so-
called "tree of decisions", based on graphs through which you can
develop predictive models that enable you to discover the output of certain
input decisions. Another concrete example is "clustering", or
mathematical models that allow you to group data, information, objects,
and so on according to their “similarity". There is then the sub-category of
"probabilistic models", which base the learning process on the calculation
of probabilities. The most known is the "Bayes network", a probabilistic
model that represents in a graph a set of random variables and its
conditional dependencies (a relationship between two or more events that
are dependent when a third event occurs ). Finally, there are artificial
neural networks that use certain algorithms to learn inspired by the
structure, functioning and connections of biological neural networks (i.e.
those in the human being). In the case of so-called multi-layer neural
networks, you enter the field of Deep Learning. The latter two are the
topics of this book.
A practical example of Unsupervised Learning
“Great, but being able to estimate the price of a car can be really considered
‘learning’?” you might be thinking. In humans, the brain can approach any
situation and learn how to deal with it without explicit instructions. If you sell
groceries for a long time, you will surely get the "feeling" about the right
price for a certain car, the best way to market it, the kind of customer that
might be interested in it, etc. The goal of Strong AI (Strong Artificial
Intelligence) research is to be able to replicate this ability on computers.
To date, Machine Learning algorithms are not so advanced - they only work
when they face a very specific, limited problem. In this case, perhaps a more
appropriate definition of "learning" would be “understanding an equation to
solve a specific problem, based on a sample of data." Unfortunately,
"Automation that allows you to understand an equation that solves a specific
problem, based on a sample of data" is not a name your granny would
understand. Therefore, we have to call it ‘Machine Learning’.
So, how do we write the program to estimate the value of a car as in our
example? Think it out for a moment before you go ahead reading.
If you do not know anything about Machine Learning, you will probably try
to write some basic rules to estimate the price of a car; you can spend hours
and hours over it, and eventually get something done. However, the program
will never be perfect and it will be difficult to keep it updated as prices
change constantly. Would it not be better if your computer could figure out
how to solve this issue in your place? We do not care what the function
exactly does, but that it produces the correct values. You can think of this as
if the price was a delicious stew and the ingredients were the mileage, the
year, the size of the engine and the model of the car. If you understand how
much each ingredient impacts on the final price, there is probably an exact
ratio among the ingredients that produces the perfect price.
This would reduce the original function (with all its 'if' and 'otherwise') to
something much simpler: by identifying the elements that weigh on our final
result, we can adjust the weight of each element until we find the perfect
combination, and our function will be able to automatically predict car prices!
A simple way to understand the best ratio of your elements is:
Step 1: Start with the weight of every element set to 1.0.
Step 2: Use the function for each car you have pre-existing data for and see
how your function’s results vary from the correct real price. For example, if
the first car was actually sold at $ 25.000 dollars, and the function produced
as a result $ 15.000, there is a difference of $ 10.000 for that particular car.
Step 3: Now sum up the square value of the difference related to each car
result and real selling price you have in your data set. Let us say you have
500 car sales in your data set, and the total difference your function has
processed is $ 8.000. This value indicates how "wrong" your function is at
present.
Step 4: Now, take that total sum and divide it by 500, to get an average
difference for each car. Call this average amount your ‘wrong function cost’.
If you could take this cost down to zero, playing with the weight ratio of the
different elements in your function, the function would be perfect. It would
mean that in all cases, the function perfectly guesses the car price based on
the input data.
Step 5: Repeat the second step over and over again with every possible
combination of weights. Whatever combination of weights makes the cost
closer to zero, is the one to use. When you find the weight that works, you
have solved the problem!
Simple, right? Well, think about what you just did. You had some data, you
fed the data into three large sets, you carried out some simple steps, and it all
ended up with a function that can guess the most convenient price of a car.
Wow! However, here are some facts that will leave you even more
astonished:
Research in many fields (such as linguistics / translation) over the last
40 years has shown that these generic learning algorithms that "mix
data" have better results than those generated by real people who use
explicit rules. The "silly" approach of Machine Learning ultimately
beats human experts (the algorithm does not know ‘why’ it is
attributing more weight to an element rather than to another, as a senior
experienced store owner would know). Have you ever used Google
Translator? I bet you have, and it will allow you to understand exactly
what I mean. The program takes the information it needs for processing
the translations from unfiltered web pages, full of spelling and syntactic
errors, and which sometimes are even incomplete. Yet the
overwhelming amount of data available allows the program to be more
reliable than all its predecessors, which were based on correct
dictionaries drafted by experts, but containing a limited amount of
information. The function we spoke about in this paragraph is actually
‘silly’. The algorithm does not know what the supplier’s origin or
freshness are; all it knows is that it needs to mix these values to output
the correct answer.
It is very likely that you have no idea why a particular combination of
weights works. Therefore, you wrote a function that you do not fully
understand, but nevertheless works perfectly.
Imagine that instead of taking the period of the year and the supplier’s
origin as parameters, the function includes a range of numbers. Let us
say that each number represents the brightness of a pixel of an image
captured by the camera mounted on the top of your car. Now let us say
that instead of looking for a value called "price", the function outputs a
forecast called "steeringwheel_rotation_degrees". You just have created
a function that can drive a car on its own! Incredible, right? Your
granny would be astonished.
But what about the "try every possible ratio" thing we spoke about in Step 3?
Okay, of course you cannot try all the existing ratio combinations of all the
possible weight features to find the best combination. If you were to take this
literally, you would continue ad infinitum because you would never fall short
of numbers to try. To avoid this, mathematics has figured out many clever
ways to quickly find good values for those weight features without having to
do many attempts. For example, try this. First, write a simple equation that
represents Step # 2:
This equation represents how wrong our estimate is of weight of the various
elements that impact the final price. If we were to trace this cost equation for
all the possible values of our weight features, we would get a graph that
might look like the one below:
We just need to adjust our weights to make the equation move towards the
middle, the lowest part of the chart. If we continue to make small adjustments
on our weights by moving them towards the lowest point, we will reach our
goal without having to try a great number of other ratios. If you remember
something about calculus, you might remember that the derivative of a
function indicates the inclination of the tangent to the function at any point.
In other words, it tells us that the direction is down for any point on our chart.
We can use this concept to continue the walk downhill. Therefore, if we
calculate a partial derivative of our cost function relatedto each of our
weights, we can subtract that value from each weight. This will allow us to
walk closer and closer to the bottom of the chart. Continue to do so and
eventually you will reach the bottom where you have the best possible
values for our weights. This is a summary of a way to find the best weights
for your function called Batch Gradient Descent.
The great thing is that when you use Machine Learning to solve a real
problem, all these calculations will be done for you by your computer.
Nevertheless, it isalways useful to have a good idea of what is going on!
I did skip some details in this explanation, for simplicity and convenience.
The three-phase algorithm I described above is called Multivariate Linear
Regression. You are estimating the equation of a line that fits all of the
points of the data related to your car. When you use this equation to estimate
car sales price, you do not know where it will appear on the line. This is
really a powerful idea with which you can solve "real" problems.
The approach I have shown can work in the simplest cases, but not in all
cases. One reason is that produce prices are not always fairly simple and
straightforward so to follow a continuous straight line. However, luckily
there are many ways to handle this problem. Many other Machine Learning
algorithms, as we will see, can handle nonlinear data (such as neural
networks or Support Vector Machines with kernels). There are also ways to
use linear regression in a smarter way that allows you to handle the most
complicated lines. In all cases, the basic idea of having to find the best
combinations always applies.
In addition, I ignored the idea of overfitting. Overfitting refers to the risk of
“excessive adaptation” of your network, which usually occurs in machine
learning or statistics when a very complex statistical model adapts to the
observed data (the sample) because it has too many parameters with respect
to the number of observations.
It is easy to come up with the ratio of a series of weights that always works
perfectly to predict car prices in the original data set, but it does not work for
all the new cars that were not present in the original data set. However, there
are ways to deal with this problem too (such as regularization/validation and
data set cross-validation). Learning how to deal with this problem is a key
element to learn how to apply Machine Learning successfully. In other
words, while the basic concept is quite simple, it takes some skill and
experience to apply Machine Learning and to get truly useful results. But it is
a skill that any developer can learn!
Once you begin to understand the simplicity with which Machine Learning
techniques can be applied to problems that seem really complex (such as
handwriting recognition), you begin feeling as if you could use Machine
Learning to solve any existing problem and get a solution as long as you have
enough data. Just enter data and look at the computer to calculate magically
the equation that dividesthem! However, it is important to stress again that
Machine Learning works only if the problem is solvable with the data you
have available. For example, a model that predicts car prices based on the
color of the car, will never work. There is no relationship between the color
of a car and its selling price. Therefore, no matter how you feel, your
computer cannot infer a relationship between the two!
So remember, if an expert human cannot use the data to solve the problem
manually, even a computer will probably not be able to solve it. Instead,
focus on problems a human being can solve, but it would be better if a
computer could solve them, for instance because it would be much faster.
In my opinion, at this time the biggest problem with Machine Learning is that
it lives mostly in the academic world and research groups. For non-experts
who wish to have a wider view on the subject, it is not easy to understand the
material about it. Nevertheless, every day this is getting better. The free work
of Andrew Ng Machine Learning Class on Coursera is amazing. I would
highly recommend going there. It should be understandable for anyone who
has a degree in computer science and remembers a minimum of math. In
addition, you can play with tons of Machine Learning algorithms,
downloading and installing SciKit-Learn. It is a python framework that has a
"black box" versions of all standard algorithms.
Key points of this Chapter
Machine Learning refers to the fact that you can feed data into a generic
algorithm and the algorithm generates its own logic based on the input
without you having to write any specific code for the problem;
There are two main categories in Machine Learning algorithms -
Supervised Learning and Unsupervised Learning;
We have Unsupervised learning when you only have input data (X)
and no corresponding output variables. We have Supervised learning
when you have input variables (x) and an output variable (Y) and you
use an algorithm to learn the mapping function from the input to the
output.
Chapter 2. Neural Networks
Notes to this Chapter
In the previous paragraph, we have said that Machine Learning uses generic
algorithms to extrapolate interesting information about data without having to
write specific code for the problem that you want to solve. Now, we will
work on building a very specific implementation: a Neural Network!
What are Neural Networks?
So, what are Artificial Neural Networks? They are a new computer
technology: think of many simple parallel processors, strongly integrated by a
network of connections that create a computational distributed model.
Single processing unit
Their architecture hence was quite innovative in the ‘50ies, when they were
born in clear analogy with the structure of the brain: many neurons (1010)
strongly connected by synapses through which the computations spread in
parallel in the cerebral cortex. This allows achieving new performances:
finding the solution in real time to complex problems, self-learning,
resistance to faults and errors, etc.
Neural Network
But how do Neural Networks work exactly?
A Neural Network consists of numerous homogeneous processing units,
strongly interconnected through links of varying intensity. The activity of the
single unit is simple, and the power of the model lies in the configuration of
connections (topology and intensity). Starting from the input units, to which
the data is fed in order to solve a problem, the computation propagates in
parallel within the network up to the output units, which provide the result. A
Neural Network is not programmed to execute a certain single activity, but
trained (using an automatic learning algorithm) by means of a series of
examples of the reality to model.
Back in the ‘50ies, Neural Networks had a simple structure with few internal
units, while the current ones involve millions of units, being able to learn
incredibly complex patterns, even if they require much more powerful
computers and more sophisticated training techniques.
If we were to describe the functioning of an Artificial Neural Network in a
sentence, we could say that it takes in input data, and ‘makes sense of it’,
finding regularities, or patterns.
As you are reading this book, your brain is organizing letters into words, and
words into sentences, until the meaning emerges. Likewise, an Artificial
Neural Network can analyze your comments on a website in search of
meaningful words and relevant topics. This drive to make sense of
information is so deeply inherent in our brain, that it is also applied to
meaningless patterns. When we look at the clouds, we spontaneously
associate them with real images and categories. This same mechanism is
present in Artificial Neural Networks, which make them extremely
fascinating and at the same time frightening.
In the previous paragraph, we created a simple algorithm that calculates the
value of a car based on some of its features. In other words, we have weighed
the value of the car by multiplying each of its features by a certain weight
value. Then we summed up these numbers until we obtained the value of the
car. Instead of using code, we tried to represent this function using a simple
diagram; however, this algorithm only works for very simple problems, for
which the result has a linear relationship with the input data. What if there
was not a simple relationship behind the determination of car prices? For
example, maybe the supplier’s origin has a lot of relevance for the price of
local produce, but it does not matter at all in the case of exotic produce. How
could we handle this kind of peculiarity in our model? To get a better result,
we could run this algorithm several times with different weights to catalog
different limit-cases.
So now, we have four different estimates. Let us combine these four results to
get a single final price. Let us put them all in the same algorithm (but using
another combination of weights)! Our new super-solution combines the
estimates of our four different attempts to solve the problem. Thanks to this,
we can model more cases than we could capture with a simple pattern.
Let us unite our four attempts in one big scheme: this is a neural network!
Each knotknows how to take a set of values as input, apply weights to them,
and calculate an output value. By linking together many knots like these, we
can model very complex functions.
I am making many generalizations to keep the explanation simple (for
example, I am not talking about the scaling feature - and about the
activation function) but the most important part is based on these ideas:
We have created a simple estimation function that takes a number of
inputs and multiplies them for various weights to get an output. We call
this simple function a neuron.
By linking many simple neurons together, we can model functions that
would be too complicated to be managed by a single neuron.
It is just like LEGO! We cannot model as much with a single LEGO cube,
but we can model anything if we have enough cubes of LEGO to unite.
McCulloch-Pitts's Neuron
So, up to now we have said that Artificial Neural Networks are computational
systems inspired by the biological processes that occur in the human brain.
Many of their features in fact are inspired by biological processes:
• they are formed by millions of computational units (called neurons)
able to execute a pondered sum;
• they have a high number of weighed connections (synapses) among
the units;
• they are highly parallel and non-linear;
• •they are adaptive and trainable and learning is done by changing
the connections’ weights;
• they are error-tolerant because storage is widespread;
• there is no distinction between memory and calculation area;
• they have generalization skills: they can produce reasonable outputs
with inputs that they never have encountered before during the learning
process.
Wow!...exciting. Let us dive into the details of how a single unit works, the
McCulloch-Pitts's Neuron.
A neuron is the fundamental unit of calculation of a Neural Network, and is
made up of 3 basic elements in this model:
1. a set of synapses or connections each of which is characterized by a
weight (synaptic efficacy); unlike the human model, the artificial model
can have both negative and positive weights;
2. a summing agent that sums up the input signals weighed by the
respective synapses, producing a linear combination of the inputs;
3. an activation function to limit the breadth of a neuron's output.
Typically for convenience the width of the outputs belongs to the
interval [0,1] or [-1,1].
The neuronal model also includes a threshold value that has the effect,
depending on its positivity or negativity, of increasing or decreasing the net
input to the activation function.
Where:
• xi are the synaptic weights of neuron k;
• uk is the linear combination of the inputs in the neuron k;
• bk is the threshold value of the neuron k;
• φ (x) is the activation function;
• yk is the output generated by the neuron k.
Sigmoid functions are the most used function in the creation of artificial
neural networks. It is a strictly growing function that exhibits a balance
between linear and non-linear behavior. For example:
where a is a parameter that indicates the slope of the function.
Types of network architectures
The way a network is structured depends on the learning algorithm that you
intend to use. In general we can identify 3 classes of networks.
1. One layer Feedforward networks. In this simple form of layered
network, we have input knots and a layer of neurons (output layer). The
signal propagates through the network in an linear way, starting from
the input layer and ending in the output one. There are no connections
that come back and no transversal connections in the output layer.
The error signal ek(n) implements a control mechanism with the aim of
applying a sequence of adjustments to the synaptic loads of neuron k in
order to bring the obtained response closer to the desired one.
2. Memory-based learning. In memory-based learning, all (or many) of
past experiences are stored in a large memory of correctly classified
input-output pairs. When the classification of an example never met
before xtest is requested, the system answers by finding and analyzing
the stored examples surrounding xtest. All memory-based learning
methods involve 2 basic ingredients:
The criterion used to define the surrounding of a xtest test vector;
The learning method applied on the examples surrounding xtest.
where d is the Euclidean distance. The class that is assigned to xtest is the
same as for x'N. A variant of this method is the k-nearest neighbor in
which:
the neighborhood of the test example is no longer one, but the set
of k stored examples closer together;
the assigned class is the one with the highest frequency
surrounding the test example.
3. Hebbian learning. This is based on the Hebb postulate on learning,
according to which when the axon of a neuron A (its output
transmission line) is close enough to excite a neuron B and this,
repetitively and persistently, sends an action potential, a growth process
starts in one or both of the neurons which increases the efficiency of A.
From this we can derive two rules:
if 2 neurons connected by a synapse are activated simultaneously,
then the weight of the synapse is progressively increased;
If 2 connected neurons are activated asynchronously, then the
weight of the synapse is progressively decreased or eliminated.
A synapse of this type is called the Hebbian synapse. If the correlation
of the signals leads to an increase in synaptic efficacy we will call this a
hebbian modification, if it leads to a reduction we will call it anti-
hebbian.
• Competitive learning. With competitive learning, neurons from a
neural network compete with each other to become active. Only a
neuron can be active at a certain n time. There are 3 basic elements in a
competitive learning method:
a set of identical neurons, unless randomly generated synaptic
weights respond differently to a given set of inputs;
a limit to the "strength" of each neuron;
a mechanism that allows neurons to compete for the right to
respond to a given subset of inputs so that only one neuron or one
group is active at a certain time. The winning neuron is called the
winner-takes-all.
In this way neurons tend to specialize on a set of similar inputs and
become specialized in recognizing the features of different classes of
inputs. In the simplest form, a neural network has only one layer of
output neurons, completely connected to the input knots (forward
excitatory connections). The network may include connections
between neurons that lead to lateral inhibitions (inhibitory feedback
connections).
Advantages and disadvantages
So, we have seen an overview of the msin basic features of a Neural Network
and understood how it basically works. I is also important that you are able to
mentally locate Neural Networks in a broader framework, so we shall
highlight what are their major advantages and disadvantages in the field of
innovation. The advantages of using a NN are:
• that they are suitable for problems that do not require accurate
answers, but approximate answers with a degree of error or variation
• their ability to generalize: they can produce good answers even
with input that has not been considered during their creation and
training
• they are easy to implement, you can just define a neuron and then
copy it and create connections between the neurons
• it delivers fast operations because it works in parallel; every neuron
uses only its input
• the stability of the output with respect to input values: input values
can be incomplete, noisy, not well known, or accept a degree of error or
change
• they can determine the result taking into account all inputs at the
same time.
On the other hand, using Neural Networks entail a series of disadvantges:
• a Neural Network works like a black box: generally, you will not be
able to understand why it produced that specific result
• the memorized knowledge can not be described and localized in the
network
• they must be used on serial computers, meaning there often will be
a lack of hardware to implement them
• they often entail sophisticated training techniques that take a long
time for calculations
• there is not always a network that can solve a specific problem,
because there is not always a learning algorithm that converges giving a
low error network output
• the output values are not accurate, but have a margin which may
vary
• we need a very large series of examples to have a good learning
process and a low output error.
Key points of this Chapter
Artificial Neural Networks are able to make decisions independently,
based on system inputs, which also include error variables. It is thanks to
these that the systems based on neural networks are able to improve with
recurring iterations, without human intervention;
When we see a neural network, we find a numeric vector, which represents
different types of data (pixels, audio signals, video signals or words, just
as some possible examples). The input vector is generally transformed by
a series of functions that work on the vector itself, and the result is the
output generated by the network;
The great peculiarity of neural networks is that the product of the system is
the prediction of some properties that the network itself tries to guess
starting from the input received. A trivial example can be the one in which
we have an image when entering the neural network and, on the basis of
some functions, it tries to guess if there are cars in the picture;
Clearly, the functions of the system are managed by the ‘brain’, which in
this case is a memory containing in turn other vectors of numbers known
as "weights". The latter define how the network inputs must be combined
and recombined to produce the most reliable result possible;
The more complex the problem is, the more the weights have to become
large (in informative terms). For example, recognizing a car inside an
image can be a relatively difficult task, depending on the quality, the
polygonal linearity or the spatiality of the objects in the picture;
The most difficult task of a designer of a neural network is the definition
of the weights, and of what "values" these must assume to ensure that the
system does a good job when there is the need to generate a prediction.
Chapter 3. Deep Learning
Notes to this Chapter
The aim of this Chapter is to understand the basics of Deep Learning, one of
the most interesting and ambitious fields of research currently under
development, which finds various applications in real life.
Let us give a memory to our Neural Network
In a neural network, we will always get the same response when we feed in
the same input values. A neural network per se has no memory. If we use
programming terms, we can define it as a Stateless Algorithm. In many
cases (like for the car price estimate), this is exactly what we need. However,
a thing that this type of model cannot do is to give answers by relying on data
that has been used in a different moment of time.
Imagine I handed you a keyboard and asked you to write a story. But before
you start, I have to guess the first word you will digit. How can I guess what
that word will be?
I can use my knowledge about language to increase my chances of guessing
the right word. For example, you will probably use a word with which many
stories start, like in “Once upon a time…”. If I could look at other stories that
you have written in the past, I could narrow down my guesses to the words
you usually use at the beginning of your stories. Once I have all this data, I
could use it to build a neural network to know the probability with which you
could use any given word. However, let us make the problem more difficult.
Let us say I need to guess the next word you will be typing at any random
place in your story. This is a much more interesting issue.
Let us use the first words in the book Pride and Prejudice by Jane Austen:
“It is a truth universally acknowledged that a single man in possession of a
good fortune, must be in want of a...”
What word would you write after these? You would probably say, “The word
is probably destined to be ‘wife.” We know answers like this according to the
letters we have already seen in the sentence and our common knowledge. In
other words, it is easy to guess the next word if we consider the sequence of
words that is written and our knowledge of the rules of language.
To solve this problem with a neural network, we need to add memory to our
model. Whenever we ask our neural network for a response, we must also
save a series of intermediate calculations that we can re-use the next time as
part of our input. In this way, our model will also adjust its predictions on
previously processed data.
Keeping track of status in our model makes it possible not only to predict the
first most likely word in a story, but also to predict the most likely next word
given all the previous words.
This is the basic idea of Recurrent Neural Networks (RNN): upgrading the
network every time we use it. This allows you to update your predictions
based on what you have experienced before. It can also create patterns over
time, as long as it has enough memory.
While the ability to predict which the next word in a story may seem rather
useless, think about the self-prediction function in your cell phone keyboard.
And what if we bring this idea to the extreme? What if we ask the model to
predict the next most likely word repeatedly - always? We are asking a
computer to write an entire book for us!
The example of book writing Software
We saw how we could guess the next word in the previous paragraph. We
can try to create a whole story using Jane Austin’s style.
To do this, we will use the Implementation of a Recurrent Neural Network,
written by Andrej Karpathy. Andrej is a researcher at Stanford and has
written an excellent introduction to text generating with RNNs. You can see
all the code for the model on github. We will create our model from the full
text Pride and Prejudice, 120, 697 words of which 84 unique letters
(including punctuation, uppercase, etc.). This data set is actually very small
compared to applications in the real world. To really create a good model in
Austin’s style, it would be much better to have many more samples of text.
However, this is enough for this example. Since we have just begun to
implement it, our Recurrent Neural Network is still not very good at
predicting the words. Here is what it generates after 100 cycles:
Just as
an example, the human body’s visual system operates on a hierarchy of levels
(deep):
The most widely used DNNs consist of a number of layers included between
7 and 50. Deeper networks (100 levels and above) have shown they can
guarantee slightly better performance, but at the expense of efficiency. The
depth (number of levels) is only one of the factors of complexity: the number
of neurons, connections and weights also characterize the complexity of a
DNN. The greater the number of weights is (i.e., the parameters to be
learned) the greater will be the complexity of the training. At the same time, a
high number of neurons (and connections) makes forward and back
propagation more expensive.
These networks are used in particular in language modeling, object
recognition and more generally for modeling complex non-linear
relationships. Some limits of this type of network are the high execution time
of the learning phase and the risk of overfitting, caused by the possibility for
networks as deep as these to specialize on the specific features of the training
set, losing their ability to generalize.
Main types of DNN
The main types of Deep Neural Network are:
1. "Discriminatory" feedforward models for classification (or
regression) with predominantly supervised training:
CNN - Convolutional Neural Network (or ConvNet)
FC DNN - Fully Connected DNN (MLP with at least two hidden
levels)
HTM - Hierarchical Temporal Memory
2. Unsupervised training ("generative" models trained to reconstruct the
input, useful for pre-training of other models and for producing salient
features):
Stacked Auto-Encoders
RBM - Restricted Boltzmann Machine
DBN - Deep Belief Networks
3. Recurring models (used for sequences, speech recognition, sentiment
analysis, natural language processing, ...):
RNN - Recurrent Neural Network
LSTM - Long Short-Term Memory
4. Reinforcement learning (to learn behaviors):
Deep Q-Learning
Convolutional Neural Networks
CNNs are a development of deep neural networks, that, being designed
specifically for image recognition, use a particular architecture. Each image
used in learning is divided into topologically compact portions (portions
which properties of space that are preserved even when deformed), each of
which will be processed by others to search for particular patterns. Formally,
each image is represented as a three-dimensional array of pixels (width,
height, color) and each of its sub-sections is put in convolution (a
mathematical way of combining two signals to form a third signal) with the
chosen filter. In other words, sliding each filter along the image, the internal
product between the filter itself and the input is calculated. This procedure
produces a set of feature maps (activation maps) for the various filters.
Overlaying the various feature maps of the same portion of image we get an
output volume. This type of layer is called convolutional layer. After the
convolutional layer we have a sub-sampling layer, that is a layer that deals
with further subdivision of the processed image, in such a way as to analyze
in detail its sub-sections, reducing its dimensionality. One type of sub-
sampling among the most used is the MaxPooling sub-sampling. In
particular, this type of sampling partitions the image into a series of
rectangles that are not overlapped, and returns the pixel from each rectangle
corresponding to the maximum value point. These two types of layers will
alternate throughout the body of the network. It is also possible to find other
types of layers combined with convolution and pooling layers. The last layer
is the output layer: it has as many neurons as possible labels exist and for
each label it gives the probability that the tested sample may belong to it.
The training phase starts by processing a group (batch) of training examples
at a time and, for each of them, after obtaining the output, it calculates the
cost functions and the cross-entropy:
The Neural Network we built in the previous Chapters was able to take only a
fewvalues as input (power of the motor, mileage, etc.). But now we want to
use our Neural Network to process images. How do we feed input to our
neural network using images and not numbers? The answer is very simple. A
neural network only takes numbers as inputs. For a computer, an image is
nothing more than a grid of numbers that represent the degree of darkness of
each pixel:
To feed an image in our neural network, we must transform the 18x18 pixel
image into a matrix of 324 numbers (below):
As for the Slide Window solution described above, we pass a sliding window
across the original image and save each result as a small separate tile: In this
way, we transformed our original image into X number of smaller images of
equal size.
Step 2: Insert each tile into a small Neural Network
Previously, we fed a single image in our neural network to see if it the image
contained a "2" or not. We will do the same thing here, but we will do it for
each of the X number of tiles created. However, there is a big difference in
this case: we will keep the same weight for each tile respect to the original
image. In other words, we are handling each set of images in the same way. If
something interesting appears on a specific tile, we will only take note of that
tile as particularly interesting.
Step 3: Save the results of each tile to a new array
To keep track of the original position of the tiles we will save the result of
our processing in a grid, maintaining the same arrangement as the tiles in the
original image. In other words, we started with a large image and we ended
up with a slightly smaller table that contains information about which tiles of
the original image that the neural network considers interesting.
Step 4: Downsampling
The result of step 3 is a matrix that outlines which parts of the original image
are the most interesting. However, this matrix still is quite large. To reduce
the size of the array, we will use the Max Pooling algorithm we spoke about
in the previous Chapters.
For each
2x2 square tile of the matrix, we keep only the highest values and get rid of
everything else.
The idea is that if we find something interesting in one of the four input cards
that make up each square of the 2x2 grid, we will just keep the most
interesting pieces. This reduces the size of an array retaining the most
important parts.
Final step: Make a prediction
So far, we have reduced a large image to a much smaller size matrix. Guess
what? This matrix is nothing more than a set of numbers that can be used as
inputs in another Neural Network. This second Neural Network will decide
whether the image is what we are looking for. To differentiate it from the
previous Convolutional Neural Network, we will call it Fully Connected
Neural Network. To wrap up, our Neural Network for Image Recognition
will consist of 5 sub-networks and the final architecture will look like this:
In the scheme we described in our example above, we start with a 224 x 224
pixel image, with two Convolution and Max-Pooling processes, followed by
three Convolution processes, another Max-Pooling process, and two layers of
Fully Connected Neural Networks.
The end result is that the Neural Network can place the image in the correct
category among the 1000 possible!
The right Architecture for a Neural Network
How can you know what steps need to be combined in order to make sure the
Neural Network is able to recognize the images? Honestly, the only way to
give a unique answer to this question is by doing a lot of testing. You might
need to test 100 networks before finding the optimal structure and parameters
for the problem you are trying to solve. Machine Learning requires a lot of
trial and error. What if we were to build now a bird classifier?
At this point we have learned enough about Deep Convolutional Neural
Networks to be able to write a program that recognizes whether a photo
contains a bird or not. As always, we need some data to get started. The
CIFAR10 dataset contains 6,000 bird pictures and 52,000 images of things
that are not birds. Nevertheless, to get even more data, we can add the
Caltech-UCSD Birds-200-2011 dataset that contains more than 12,000 bird
photos. Here are some of the bird's photos taken from the datasets described
above:
And here we have some of the 52,000 photos that do not contain birds:
This data set works well for our purposes, but 72,000 low resolution images
would not be enough for real-world applications. To achieve performance at
the same levels as Google, we would need millions of high resolution images.
In Machine Learning, having more data is usually more important than
having better algorithms. That is why Google is so happy to offer unlimited
space for your photos and documents. They want your data, sweet data! To
build our classifier, we will use TFLearn. TFlearn is a Wrapper around the
TensorFlow Deep Learning Library offering free of charge by Google,
which exposes a simplified API precisely for this reason: making
Convolutional Neural Networks easy to build as a couple of rows of code.
If we train our Neural Network with a good video card and enough RAM
(such as an Nvidia GeForce GTX 980 Ti or better), the whole process will
last less than an hour. If you train it with a normal CPU, it may take a lot
longer. Increased training corresponds to an increase in identification skills.
After the first passage, I got 75.4% of accuracy. After only 10 passages, we
are already at a 91.7%. After 50 passages, the marginal precision increase
reduces after every step to reach the precision peak of 95.5%.
Congratulations! Our program is now able to recognize bird pictures!
Test your Neural Network
Now that we have a well-trained neural network, it is time to test it! Here is a
simple script that given a single image is able to recognize whether it is a bird
or not. To see how effective our network is, we need to try it with many
images. For this purpose, I kept 15,000 images from the initial dataset to use
as validation. When I analyzed those 15,000 images, the Neural Network
found the correct answer in the 95% of cases. This seems a good result, right?
Well, it depends, how accurate is 95% accurateness?
Our network claims to be correct in 95% of cases. However, the devil is in
the details, and our 95% could mean many different things. For example,
what if only 5% of our validation images contained birds while the remaining
95% did not? A program that predicted "It Is Not a Bird" every time would
be accurate in 95% of cases! However, it would also be 100% useless. We
need to look more closely at the numbers going beyond the overall level of
precision. To judge the reliability of a classification system, we need to look
beyond the error rate as well as the type of mistakes made. Instead of
thinking about the predictions given by the Neural Network in terms of
"right" and "wrong", we will divide them into four distinct categories:
1. True Positives: Birds Our Neural Network Identifies Properly as Birds.
2. Negative Positives: Photos That Do not Contain Birds and Our Neural
Network Identifies Properly As Non-Birds.
3. False Positives: images that the Neural Network classifies as birds but
that actually do not contain any birds. Many planes were confused for
birds, understandable!
4. False Negatives: images that the Neural Network classifies as non-birds
but which are actually birds.
Why should we break down the results this way? Because not all errors are
the same. Imagine if we were writing a program to detect cancer though
magnetic resonance imaging. In this case, we would prefer finding both false
positives and false negatives. False negatives would be the worst case
possible - the program would not be able to diagnose cancer in the case of a
patient who is actually suffering from it delaying the starting of cures.
Instead of just looking at overall precision, we can calculate Precision and
Recall metrics because they give us a clearer picture of the Neural
Network’s performance:
This tells us that 97% of the time we have been able to identify the birds
correctly!
But it also tells us that we found only 90% of the birds in the dataset. In other
words, the Neural Network has not been able to recognize all of the birds, but
it is very sure of its judgment once it has identified one!
Now that you know the basics of Deep Convolutional Neural Networks,
you can try playing with tflearn and testing various Neural Network
architectures. Tflearn provides data sets to skip the data collection step and
immediately get to the algorithm writing part.
Key points of this Chapter
In this Chapter, we have learnt what is the process to write a
Convolutional Neural Network that recognize objects within images;
We have learnt what Convolution is, a tool that resembles human
intuitive understanding of the conceptual hierarchy within an image;
We have gone through an image processing pipeline which involved a
series of different steps: Convolution, Max-Pooling and finally a Fully
Connected Neural Network;
To solve real-world problems, these steps can be combined and stacked
every time you think it is necessary. The more Convolution steps you
have, the more your neural network will be able to understand and
process complex functions.
CONCLUSIONS
“Torture the data, and it will confess to anything.”
– Ronald Coase
Machine Learning is the science that allows computers to perform certain
actions without having been explicitly programmed to execute them and
represents one of the fundamental areas of development in the future of
artificial intelligence. Deep Learning is a branch of Machine Learning and is
based on a set of algorithms that attempt to model high abstractions level in
data; generally, these algorithms provide multiple stages of processing,
having often a complex structure and normally these stages are composed of
a series of non-linear transformations. As you have learnt in this book,
convolutions are widely used in Deep Learning, especially for computer
vision applications. The architectures used are often anything but simple.
Very recently, Facebook has decided to invest heavily in this sector with the
aim of being able to understand (and exploit economically) the hidden
meaning of every single post or image published by every one of the millions
of users of the most famous social network of the world. All of the great hi-
tech players, starting from the already mentioned Facebook up to Google,
passing through Yahoo! and Microsoft are watching with great attention the
developments that are taking place in this sector. Investing, perhaps, more
than a few dollars in the most advanced research institutes. And the reason is
soon said. Deep learning could (and should) improve the way computer
systems analyze natural language. As a result, it should improve
understanding. If the path taken will lead to the desired results, deep learning
should allow the neural networks that form the information systems to
process natural languagesjust as it happens in the human brain. Computers, in
short, will be able to understand what human users write on their bulletin
board or what they really want to look for; whether they are sad or happy; if
the image they have just viewed was disliked or not.
Facebook, Google, Microsoft and Yahoo!, therefore, could create
increasingly in-depth and accurate commercial insights, reselling this data to
communication and marketing agencies all over the world for increasingly
accurate targeted advertising campaigns tailored on the needs of users.
As often happens in the world of high technology, also small companies have
a pivotal role in launching and carrying forward great revolutions. And this
has happened also in the field of deep learning, where small startups are
doing a lot of the work. AlchemyAPI, for example, is a small software house
that has been active in this sector for some time now. So far, its efforts have
focused on the implementations for natural language recognition, but in
recent months it has launched itsplatform of neural networks for image
recognition.
Microsoft, however, is among the companies to have reported the greatest
results in this field. Last November, at an event in China, Microsoft research
labs showed the world what their deep learning systems are capable of,
showing an automated system of instant translation from English to Mandarin
Chinese. A bit underhanded, but also Yahoo! is trying to make progress in the
sector. Following the acquisition and incorporation policy desired by CEO
Marissa Mayer, Yahoo! has recently acquired two of the best start-ups in the
field of deep learning: IQ Engines and LookFlow. Probably much of the
know-how of these two companies will be used to make more and more
"intelligent" and dynamic Flickr, but it is not excluded that the volcanic CEO
can hold some surprises for its users.
Finally, Google is probably the company that has invested most in this sector
to make its search algorithm perfect. Thanks to the steps forward recorded by
deep learning, Big G is able to offer exclusive services and tools to its users.
The voice recognition system on which Google Now is based, for example, is
the result of in-depth research work in this area, as is the face recognition
system on which one of the many photographic features of Google+ is based.
Not to mention, then, the linguistic analysis tools used by the search engine,
for results that are increasingly refined and close to the needs of users.
In Deep Learning there are still many unknown areas, and still a lot of
implementations are needed. The theory that explains because it works so
well is currently still incomplete and probably there is no book or guide that
is better than direct experience.