Machine Learning and Pattern Recognition Week 8 Neural Net Intro
Machine Learning and Pattern Recognition Week 8 Neural Net Intro
We’ve seen that we can get a long way with linear models, and generalized linear models
(linear models combined with a non-Gaussian observation model).
Linear models are still widely used, and should still be implemented as baselines, even if
you’re convinced you need something more complicated. However, making a linear model
work well might require some insight into how to transform the inputs and outputs (“feature
engineering”). You can think of neural networks1 as linear models with additional parts,
where at least some of the feature transformations can also be learned.
Parameters are fitted for a series of stages of computation, rather than just the weights for a
single linear combination. The benefit of neural networks over linear models is that we can
learn more interesting functions. But fitting the parameters of a neural network is harder:
we might need more data, and the cost function is not convex.
and then take a linear combination of those to form our final function:
Here I’ve chosen to put in bias parameters in the final step, rather than adding a con-
stant basis function. This function is a special case of a “neural network”. In particular a
“feedforward (artificial) neural network”, or “multilayer perceptron” (MLP).
The function has many parameters θ = {{v(k) , b(k) }kK=1 , w, b}. What would make it a neural
network is if we fit all of these parameters θ to data. Rather than placing basis functions by
hand, we pick the family of basis functions, and “learn” the locations and any other parame-
ters from data. A neural network “learning algorithm”, is simply an optimization procedure
that fits the parameters to data, usually (but not always) a gradient-based optimizer that
iteratively updates the parameters to reduce their cost. In practice, optimizers can only find
a local optimum, and in practice optimization is usually terminated before convergence to
even a local optimum.
2. A natural question from keen students at this point is: “what non-linearity should I use?”. As with many
questions in machine learning, the answer is “it depends” and “we don’t know yet”. ReLUs (named after Relu
Patrascu, a friendly sysadmin at the University of Toronto) replaced logistic sigmoids in generic hidden layers
of many neural networks as being easy to fit. However, now I would always use a PReLU instead, which have
worked better in cases I’ve tried. There are several other variants, including GELUs, SELUs. The small differences
between these non-linearities don’t tend to be where big advances come from. Fully differentiable non-linearities
like soft-plus log(1 + e a ), which looks like a ReLU, will make some optimizers happier. Logistic sigmoids are still
useful as switches, used in mixtures of experts, LSTMs, and adapting models. Although some of this work is
theoretically motivated, what cross-validates the best is what ultimately wins in practice.
5 Further reading
Bishop’s introduction to neural networks is Section 5.1. Bishop also wrote another book,
published in 1995: Neural Networks for Pattern Recognition. Despite being 25 years old, and so
missing out on more recent insights, it’s still a great introduction!
MacKay’s textbook Chapter 39 is on the “single neuron classifier”. The classifier described in
this chapter is precisely logistic regression, but described in neural network language. Maybe
this alternative view will help.
Murphy’s quick description of Neural Nets is in Section 16.5, which is followed by a literature
survey of other variants.
Theoretical Neuroscience (Dayan and Abbott) has more detail about biological neural networks
and theoretical models of how they learn.