DLAI4 Revision
DLAI4 Revision
Contents
1 Introduction 2
3 Statistical learning 2
5 Representation 3
6 Information theory 3
6.1 Sigma algebras . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
6.2 Entropy, mutual information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
7 Training 4
9 Energy-based networks 5
[email protected] 1 of 6
Deep Learning and Artificial Intelligence Epiphany 2024
1 Introduction
Well done on making it through the course!
In this lecture we’ll revise some key elements of what we covered. We will go through the sequence of lecture
topics we covered through the year, covering basically what I expect you to be able to do for the exam. Rather
than going into detail, we will highlight the main important ideas from each topic.
I emphasise that the best thing to do to revise is probably to go over the questions at the end of lecture slides
and revise the problems/solutions from the formative assignments. There are also further questions in Calin [2020]
and Zhang et al. [2021], which are worth looking at.
• We usually consider data to be generated according to some process which leads to an underlying data
distribution.
• The general idea of machine learning can be considered as trying to find latent (low-dimensional) representa-
tions of (generally) high-dimensional data distributions. That is, for data X, we want to find some function
z with X ′ = z(X) for which:
1. dim(X ′ ) ≪ dim(X)
2. f is well-behaved, in some sense
3. Either z is nearly one-to-one, so we can recover X from X ′ , or for some outcome Y of interest we have
that Y |X ′ has approximately the same distribution as Y |X.
where item 3 essentially states that we ‘simplify’ X while retaining it’s ‘usefulness’.
• Usually if X comes from some high-dimensional space, we are concerned with some function f (X), which
we don’t know, but for which we have some observations, and we want to be able to evaluate f (X) at other
values of X. We won’t be able to evaluate f exactly, but we can approximate it with simpler functions.
• Neural networks can represent any realistic function over X, by making them wide enough. If there is latent
structure in X, we can represent functions more efficiently by making the neural network deeper.
• A major advantage of neural networks over other machine learning methods is that they can be trained
efficiently, using back-propogation, which depends on the network structure.
3 Statistical learning
The idea of this lecture is to revise standard ideas from statistical learning, which you will mostly have encountered
in previous courses.
You should be able to:
1. Understand the ideas of a dataset, the data distribution, expected value, and loss/cost function.
[email protected] 2 of 6
Deep Learning and Artificial Intelligence Epiphany 2024
5 Representation
These lectures are intended as a run-through of major representation results in the theory of neural networks. The
most important take-away is the general heuristic ideas of why neural networks can approximate arbitrary functions;
you should be able to take a simple network architecture and a simple (but general) class of functions, and design
a neural network which can approximate any function in that class.
Generally, you should be able to:
1. Define a n-discriminatory activation function and indicate whether common activation functions are 1-
discriminatory.
2. State and apply standard universal approximation theorems of Cybenko and Hornik
3. Understand the idea of approximating a class of functions with another, and the use of the supremum norm
for this purpose.
4. Describe the set of functions which can be exactly implemented by a simple neural network.
5. For certain simple neural networks and simple classes of functions, show universal approximation results from
scratch (see lecture exercises and assignments for examples)
6 Information theory
These lectures are as close as we get to a link between the fundamental ideas of machine learning and the practical
maths of how they work. We look at information in two ways.
[email protected] 3 of 6
Deep Learning and Artificial Intelligence Epiphany 2024
2. Recall the basic properties of entropy, differential entropy, mutual information, and conditional entropy.
3. Be able to prove basic inequalities regarding entropy and conditional entropy. Remember the use of Jensen’s
inequality (or just the inequality ln(x) ≤ x − 1) for proofs of inequalities.
7 Training
In this series of lectures we looked at training neural networks, and general training algorithms for machine learning
problems. You should be able to:
1. Derive backpropagation formulas for a standard neural network
2. Describe the problems which arise if gradient descent proceeds too slowly or too fast.
3. Roughly describe the dropout algorithm and why it is useful
[email protected] 4 of 6
Deep Learning and Artificial Intelligence Epiphany 2024
9 Energy-based networks
In this topic, we looked at a different conception of neurons which fired randomly with a given probability. As well
as being of important theoretical interest, neural networks of this type can be used usefully to learn distributions.
You should be able to:
1. Sketch a stochastic neuron and describe its output as a probability distribution depending on its inputs
2. Sketch a Boltzmann machine and a restricted Boltzmann machine.
3. Describe how a Boltzmann machine evolves over time
4. Give and use the formula for energy of a configuration, and the Boltzmann distribution of states
5. Describe a Boltzmann machine as a Markov chain and describe formally that the long-run probability of being
in a given state is given by the Boltzmann distribution
Exercises
References
Jay Alammar. The illustrated transformer [blog post], 2018. URL https://jalammar.github.io/
visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/.
Robert B Ash. Information theory. Courier Corporation, 2012. URL https://doc.lagout.org/Others/
Information%20Theory/Information%20Theory/Information%20Theory%20-%20Robert%20Ash.pdf.
Ovidiu Calin. Deep Learning Architectures: A Mathematical Approach. Springer, 2020.
George Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and
systems, 2(4):303–314, 1989.
Ayan Das. Building diffusion model’s theory from ground up. In The Third Blogpost Track at ICLR 2024,
2024. URL https://d2jud02ci9yv69.cloudfront.net/2024-05-07-diffusion-theory-from-scratch-58/
blog/diffusion-theory-from-scratch/.
Giancarlo Giacaglia. How transformers work, 2019. URL https://towardsdatascience.com/
transformers-141e32e69591.
Trevor Hastie, Robert Tibshirani, Jerome H Friedman, and Jerome H Friedman. The elements of statistical learning:
data mining, inference, and prediction, volume 2. Springer, 2009.
Geoffrey E Hinton. A practical guide to training restricted boltzmann machines. In Neural Networks: Tricks of the
Trade: Second Edition, pages 599–619. Springer, 2012.
Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward networks are universal approxima-
tors. Neural networks, 2(5):359–366, 1989.
Moshe Leshno, Vladimir Ya Lin, Allan Pinkus, and Shimon Schocken. Multilayer feedforward networks with a
nonpolynomial activation function can approximate any function. Neural networks, 6(6):861–867, 1993.
[email protected] 5 of 6
Deep Learning and Artificial Intelligence Epiphany 2024
[email protected] 6 of 6