0% found this document useful (0 votes)
14 views39 pages

Universal Approximation Updated

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views39 pages

Universal Approximation Updated

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

Universal Approximation

Anano Tamarashvili, Ani Okropiridze, Mariami Mamageishvili

November 28, 2024

Anano Tamarashvili, Ani Okropiridze, Mariami Mamageishvili Universal Approximation November 28, 2024 1 / 39
Contents

1 Types of Convergence

2 Lebesgue Dominated Convergence Theorem

3 Hahn-Banach Theorem

4 Riesz Representation Theorem

5 Universal Approximation

6 Common Activation Functions

7 References

Anano Tamarashvili, Ani Okropiridze, Mariami Mamageishvili Universal Approximation November 28, 2024 2 / 39
Types of Convergence

Pointwise Convergence

Pointwise convergence defines the convergence of functions in terms of the


convergence of their values at each point of their domain.
Definition
Suppose that (fn ) is a sequence of functions fn : A → R and f : A → R. Then
fn → f pointwise on A if fn (x) → f (x) as n → ∞ for every x ∈ A.

We say that the sequence (fn ) converges pointwise if it converges pointwise to


some function f , in which case

f (x) = lim fn (x).


n→∞

Anano Tamarashvili, Ani Okropiridze, Mariami Mamageishvili Universal Approximation November 28, 2024 3 / 39
Types of Convergence

Uniform Convergence

Definition
Suppose that (fn ) is a sequence of functions fn : A → R and f : A → R. Then
fn → f uniformly on A if, for every ε > 0, there exists N ∈ N such that

n > N =⇒ |fn (x) − f (x)| < ε for all x ∈ A.

Anano Tamarashvili, Ani Okropiridze, Mariami Mamageishvili Universal Approximation November 28, 2024 4 / 39
Lebesgue Dominated Convergence Theorem

Lebesgue Dominated Convergence Theorem

Theorem
Let X be a measure space, µ be a Borel measure on X, g : X → R be L1 and
{fn } be a sequence of measurable functions from X → R such that
|fn (x)| ≤ g(x) for all x ∈ X and {fn } converges pointwise to a function f . Then
f is integrable and
Z Z
lim fn (x)dµ(x) = f (x)dµ(x).
n→∞

Anano Tamarashvili, Ani Okropiridze, Mariami Mamageishvili Universal Approximation November 28, 2024 5 / 39
Lebesgue Dominated Convergence Theorem

Why is domination necessary?


Let’s see where things can go wrong if a sequence {fn } is not dominated by any
function. Take, for instance, the sequence of functions {fn } where for each n ∈ N
we define
(
n, if 0 < x ≤ n1
n(x) = n χ(0,1/n] (x) =
0, otherwise.
Then fn → 0 pointwise. But notice there is no integrable function g such that
|fn (x)| ≤ g(x) for all x ∈ (0, 1] and for all n. This is because for large values of
n, the height of fn is tending towards infinity, or in other words, the fn are
unbounded. Since the area of each rectangle is 1, we see that the integral and the
limit do not commute in this example. Explicitly:
Z 1 Z 1
1 = lim fn (x) dx ̸= lim fn (x) dx = 0
n→∞ 0 0 n→∞
where we have used the fact that
lim fn (x) = 0
n→∞
and that the Riemann and Lebesgue integral coincide in this case.
Anano Tamarashvili, Ani Okropiridze, Mariami Mamageishvili Universal Approximation November 28, 2024 6 / 39
Lebesgue Dominated Convergence Theorem

An example using the DCT


Compute the following integral:

n sin nx
Z 
lim dx.
n→∞ R x(x2 + 1)

Solution
Let x ∈ R and begin by defining

n sin nx

fn (x) =
x(x2 + 1)
1
Each fn is measurable* and the sequence {fn } converges pointwise to 1+x2 :

x
!
sin n 1 1
lim fn (x) = lim x 2
=
n→∞ n→∞
n 1+x 1 + x2

x

sin n
lim x =1
n→∞
n
Anano Tamarashvili, Ani Okropiridze, Mariami Mamageishvili Universal Approximation November 28, 2024 7 / 39
Lebesgue Dominated Convergence Theorem

Continuation

1
From this, we also see that g(x) = 1+x 2 works as a dominating function. Indeed,

g is integrable on R and

sin nx

1 x 1 1 1
|fn (x)| = x · 2
= sin · 2
≤ = g(x)
n 1 + x n x/n 1 + x 1 + x2

since | sin(x)| ≤ |x| for all x.


Thus, we apply the Dominated Convergence Theorem (DCT) to conclude
Z ∞
n sin nx n sin nx
Z  
lim dx = lim dx =
n→∞ R x(x2 + 1) n→∞ −∞ x(x2 + 1)

Z ∞ ∞
1
2
dx = tan−1 (x) = π.
−∞ 1 + x −∞

Anano Tamarashvili, Ani Okropiridze, Mariami Mamageishvili Universal Approximation November 28, 2024 8 / 39
Lebesgue Dominated Convergence Theorem

Example

Consider training a neural network that tries to approximate a continuous target


function f .We create a sequence of network functions fn where each fn
represents the network’s output after n training steps. As n → ∞, fn should
ideally converge to the target function f . LDCT ensures that the integral of fn
(related to the network’s prediction error over the domain) converges to the
integral of f . This is important for proving that, in the limit, the network can
approximate f over the entire domain, allowing for universal approximation.

Anano Tamarashvili, Ani Okropiridze, Mariami Mamageishvili Universal Approximation November 28, 2024 9 / 39
Hahn-Banach Theorem

Hahn-Banach Theorem

Theorem (Hahn-Banach Theorem - Geometric Form)


Let V be a normed vector space and A, B ⊂ V be two non-empty, closed,
disjoint, and convex subsets such that one of them is compact. Then there exists
a continuous linear functional f ̸≡ 0, some α ∈ R, and an ϵ > 0 such that

f (x) ≤ α − ϵ for any x ∈ A

and
f (y) ≥ α + ϵ for any y ∈ B.

Corollary
Let V be a normed vector space over R and U ⊂ V be a linear subspace such that
U ̸= V . Then there exists a continuous linear map f : V → R with

f (x) = 0 for any x ∈ U, and f ̸≡ 0.

Anano Tamarashvili, Ani Okropiridze, Mariami Mamageishvili Universal Approximation November 28, 2024 10 / 39
Riesz Representation Theorem

Riesz Representation Theorem

Theorem
Let Ω be a subset of Rn and F : C(Ω) → R be a linear functional on the space of
continuous real functions with domain on Ω. Then there exists a signed Borel
measure µ on Ω such that for any f ∈ C(Ω), we have that
Z
F (f ) = f (x)dµ(x).

Anano Tamarashvili, Ani Okropiridze, Mariami Mamageishvili Universal Approximation November 28, 2024 11 / 39
Riesz Representation Theorem

The Riesz Representation Theorem connects linear functionals to integrals


involving a measure, which is essential for understanding how neural networks can
approximate functions in functional spaces. In functional approximation, we often
want to evaluate how ”close” an approximation is to a target function, and this
theorem provides a structured way to express functionals (such as a loss function)
as integrals

Anano Tamarashvili, Ani Okropiridze, Mariami Mamageishvili Universal Approximation November 28, 2024 12 / 39
Riesz Representation Theorem

Riesz Representation Theorem in Hilbert Space

Theorem
If T is a bounded linear functional on a Hilbert space H, then there exists some
g ∈ H such that for every f ∈ H we have

T (f ) = ⟨f, g⟩.

Moreover, ∥T ∥ = ∥g∥, where ∥T ∥ denotes the operator norm of T , and ∥g∥ is the
Hilbert space norm of g.

Anano Tamarashvili, Ani Okropiridze, Mariami Mamageishvili Universal Approximation November 28, 2024 13 / 39
Riesz Representation Theorem

Example: Optimization in Machine Learning and Signal


Processing

In many applications, we need to minimize functionals of the form


J(f ) = ∥f − g∥2 , where g ∈ H and H is a Hilbert space.
Using the Riesz Representation Theorem, the minimizer f is characterized by
the inner product:
⟨f, g⟩ = ∥g∥2 .
Application: In support vector machines, we optimize using inner products
(kernels) where projections are onto a Hilbert space of functions.
The Riesz theorem underlies techniques for optimal filters, by allowing inner
product representations of functions.
This optimization concept is applied in designing classifiers and filters in
machine learning.

Anano Tamarashvili, Ani Okropiridze, Mariami Mamageishvili Universal Approximation November 28, 2024 14 / 39
Riesz Representation Theorem

Example: Functional Representation in Sobolev Spaces


(PDEs)
Sobolev spaces H 1 (Ω) are Hilbert spaces where functions have
square-integrable derivatives.
R
Given a functional L(u) = Ω f u dx based on boundary conditions in a PDE,
the theorem provides a unique u ∈ H 1 (Ω) such that

L(v) = ⟨v, u⟩H 1 (Ω) .

Application: This representation is key in finite element methods for solving


PDEs numerically.
For instance, the weak form of Poisson’s equation −∆u = f on Ω with
Dirichlet boundary conditions is solved by finding u ∈ H01 (Ω) such that
Z Z
∇u · ∇v dx = f v dx.
Ω Ω

This is widely used in engineering applications, like structural analysis and


fluid dynamics.
Anano Tamarashvili, Ani Okropiridze, Mariami Mamageishvili Universal Approximation November 28, 2024 15 / 39
Riesz Representation Theorem

Example: Measure Theory and Probability

The Riesz Representation Theorem for continuous functionals on C([a, b])


states that there exists a unique measure µ such that
Z
ϕ(f ) = f dµ.

Application: This is foundational in probability, as it represents expectations


in terms of integrals.
For a probability space with distribution µ, the expectation E[X] for a
random variable X is represented as
Z
E[X] = X dµ.

This representation is crucial in defining probability distributions and


expectations, used in fields like finance and statistical mechanics.

Anano Tamarashvili, Ani Okropiridze, Mariami Mamageishvili Universal Approximation November 28, 2024 16 / 39
Riesz Representation Theorem

Connecting to the Universal Approximation Theorem

The Universal Approximation Theorem states that a neural network with at least
one hidden layer and a non-linear activation function can approximate any
continuous function on a compact domain to any desired accuracy. The LDCT
and Riesz Representation Theorem support this concept by providing the
mathematical foundation for convergence and approximation:
LDCT helps ensure that as the neural network’s weights are adjusted to
approximate the target function, the sequence of network approximations
converges in an integrable sense.
Riesz Representation allows us to interpret approximation errors as integrable
functionals, linking them to the network’s error and providing a structured
way to evaluate convergence in function space.

Anano Tamarashvili, Ani Okropiridze, Mariami Mamageishvili Universal Approximation November 28, 2024 17 / 39
Universal Approximation

Universal Approximation

Definition
Given a topological space Ω, we define

C(Ω) := {f : Ω → R | f is continuous};

Definition
Let Ω be a topological space and f : R → R. We say that a neural network with
activation function f is a universal approximator on Ω if Σn (f ) is dense in C(Ω),
the set of continuous functions from Ω to R.

Anano Tamarashvili, Ani Okropiridze, Mariami Mamageishvili Universal Approximation November 28, 2024 18 / 39
Universal Approximation

What is the Universal Approximation Theorem?

The Universal Approximation Theorem states that a feedforward neural network


with a single hidden layer and a finite number of neurons can approximate any
continuous function on a compact subset of Rn , given an appropriate activation
function.
Formally, let C(K) be the space of continuous functions on a compact set
K ⊆ Rn . For any continuous function f ∈ C(K) and for any ϵ > 0, there exists a
feedforward neural network fˆ with a single hidden layer such that:

|f (x) − fˆ(x)| < ϵ for all x ∈ K

This means that the neural network fˆ(x) can approximate the function f (x) to
within any arbitrary degree of accuracy ϵ, given a sufficient number of neurons in
the hidden layer.

Anano Tamarashvili, Ani Okropiridze, Mariami Mamageishvili Universal Approximation November 28, 2024 19 / 39
Universal Approximation

How Neural Networks Approximate Functions

Neural networks approximate functions by adjusting the weights and biases of


their neurons. During training, the network iteratively adjusts these parameters to
minimize the error between its predictions and the actual outputs.
Input Layer: Accepts input data.
Hidden Layers: Processes the input through weighted connections and
activation functions.
Output Layer: Produces the final result or prediction.

Anano Tamarashvili, Ani Okropiridze, Mariami Mamageishvili Universal Approximation November 28, 2024 20 / 39
Universal Approximation

Neural Network Structure

A neural network’s function fˆ(x) can be described mathematically as a


composition of linear transformations and activation functions.
For a network with a single hidden layer, the output is given by:
M
X
fˆ(x) = ci · σ(wiT x + bi )
i=1

where:
M is the number of neurons in the hidden layer,
ci are the weights associated with the output layer,
wi and bi are the weights and biases of the hidden neurons, and
σ is the activation function (commonly non-linear).
By adjusting ci , wi , and bi , the neural network can approximate any continuous
function f (x) over a given domain.

Anano Tamarashvili, Ani Okropiridze, Mariami Mamageishvili Universal Approximation November 28, 2024 21 / 39
Universal Approximation

Compactness and Continuity

The theorem applies to functions defined on a compact set K ⊆ Rn .


A set is compact if it is closed and bounded.
Compactness ensures that the function f (x) is bounded and behaves well on
the domain K, which simplifies the approximation process.

Anano Tamarashvili, Ani Okropiridze, Mariami Mamageishvili Universal Approximation November 28, 2024 22 / 39
Universal Approximation

Definition of n-Discriminatory Activation Function

Let n be a natural number. We say an activation function f : R → R is


n-discriminatory if the only signed Borel measure µ such that:
Z
f (y · x + θ) dµ(x) = 0 for all y ∈ Rn and θ ∈ R

is a zero measure.
This implies that the activation function is discriminatory if the only way the
integral of the function can be zero for all inputs is if the measure is zero.

Anano Tamarashvili, Ani Okropiridze, Mariami Mamageishvili Universal Approximation November 28, 2024 23 / 39
Universal Approximation

Intuition Behind n-Discriminatory Property

An n-discriminatory activation function is one that can distinguish between


different input patterns for any dimensionality n.
It ensures that the network can separate different features or classes in the
data by mapping them to non-zero outputs.
Example: The function effectively ”filters” data by ensuring no non-trivial
measure satisfies the integral condition.

Anano Tamarashvili, Ani Okropiridze, Mariami Mamageishvili Universal Approximation November 28, 2024 24 / 39
Universal Approximation

Definition of Discriminatory Activation Function

Definition
An activation function f : R → R is discriminatory if it is n-discriminatory for all
natural numbers n.
This means that the activation function can distinguish between all possible linear
combinations of inputs in any dimensionality of the input space.

Anano Tamarashvili, Ani Okropiridze, Mariami Mamageishvili Universal Approximation November 28, 2024 25 / 39
Universal Approximation

Importance of Discriminatory Activation Functions

Discriminatory activation functions are key to enabling neural networks to


model complex, non-linear relationships.
They allow networks to separate data points or features that may not be
linearly separable.

Anano Tamarashvili, Ani Okropiridze, Mariami Mamageishvili Universal Approximation November 28, 2024 26 / 39
Universal Approximation

Properties of Activation Functions

The theorem requires that σ(x) be:


Non-constant
Bounded
Continuous
Monotonically increasing
These properties allow the neural network to capture complex, non-linear
relationships in the data.

Anano Tamarashvili, Ani Okropiridze, Mariami Mamageishvili Universal Approximation November 28, 2024 27 / 39
Common Activation Functions

ReLU Activation Function

Definition
The Rectified Linear Unit (also denoted ReLU) is a function R → R defined by

ReLU(x) = max(0, x)

Discriminatory Property: ReLU allows the network to ”activate” only


positive input signals, helping it discriminate between different feature types.
The ReLU activation function is commonly used in hidden layers of neural
networks.

Anano Tamarashvili, Ani Okropiridze, Mariami Mamageishvili Universal Approximation November 28, 2024 28 / 39
Common Activation Functions

Sigmoid Activation Function

Definition
A function f : R → R is called a sigmoid if it satisfies the following two properties:

lim f (x) = 1 and lim f (x) = 0.


x→∞ x→−∞

It maps input values to the range (0, 1):


1
σ(x) =
1 + e−x
Discriminatory Property: Sigmoid produces outputs that can be interpreted
as probabilities, distinguishing between different classes.
The Sigmoid activation function is widely used in output layers for binary
classification tasks.

Anano Tamarashvili, Ani Okropiridze, Mariami Mamageishvili Universal Approximation November 28, 2024 29 / 39
Common Activation Functions

Softmax Activation Function

Softmax converts raw output values (logits) into probabilities:

exi
Softmax(xi ) = P xj
je

Discriminatory Property: Softmax ensures that the sum of the output


probabilities equals 1, allowing clear distinction between multiple classes.
Softmax is used in the output layer for multi-class classification problems.

Anano Tamarashvili, Ani Okropiridze, Mariami Mamageishvili Universal Approximation November 28, 2024 30 / 39
Common Activation Functions

ReLU in Neural Networks

ReLU is effective at learning complex features in the hidden layers of a neural


network.
By activating only positive inputs, ReLU provides sparse activations, helping
the network focus on the most important features.
Example: In image recognition, ReLU helps the network detect edges,
shapes, and textures effectively.

Anano Tamarashvili, Ani Okropiridze, Mariami Mamageishvili Universal Approximation November 28, 2024 31 / 39
Common Activation Functions

Sigmoid for Binary Classification

Sigmoid is often used in the output layer for binary classification problems.
It provides a probability score, indicating the likelihood of a data point
belonging to a particular class.
Example: For spam email detection, the Sigmoid output helps determine
whether an email is spam (1) or not spam (0).

Anano Tamarashvili, Ani Okropiridze, Mariami Mamageishvili Universal Approximation November 28, 2024 32 / 39
Common Activation Functions

Softmax for Multi-Class Classification

Softmax is commonly used in the output layer for tasks like digit
classification (0-9).
It assigns probabilities to each class, enabling the network to select the most
probable category.
Example: For digit recognition, Softmax ensures that the network outputs a
probability distribution over the 10 digits.

Anano Tamarashvili, Ani Okropiridze, Mariami Mamageishvili Universal Approximation November 28, 2024 33 / 39
Common Activation Functions

Training Neural Networks with Discriminatory Activation


Functions

Discriminatory activation functions ensure that the gradients during training


are non-zero, enabling the network to learn effectively.
By promoting differentiation between inputs, these functions allow the
network to adjust weights efficiently during backpropagation.

Anano Tamarashvili, Ani Okropiridze, Mariami Mamageishvili Universal Approximation November 28, 2024 34 / 39
Common Activation Functions

Gradient Descent and Backpropagation

Gradient Descent is an optimization method that minimizes the loss


function by updating weights based on the gradient.
Backpropagation calculates the gradient of the loss with respect to each
weight and propagates it backward through the network.
Discriminatory activation functions enable faster convergence and better
learning.

Anano Tamarashvili, Ani Okropiridze, Mariami Mamageishvili Universal Approximation November 28, 2024 35 / 39
Common Activation Functions

ReLU Example with Gradient Flow

ReLU promotes positive gradient flow, especially when inputs are positive.
This helps neural networks learn hierarchical features, making ReLU an
excellent choice for deep networks.
Example: In object detection, ReLU aids in distinguishing between key
features, improving feature extraction.

Anano Tamarashvili, Ani Okropiridze, Mariami Mamageishvili Universal Approximation November 28, 2024 36 / 39
Common Activation Functions

Practical Limitations of the Universal Approximation


Theorem

While the Universal Approximation Theorem is mathematically elegant, it has


practical limitations:
Network Size and Efficiency
The theorem guarantees approximation of any continuous function, but it
does not specify the required network size. In some cases, achieving high
accuracy may require an impractically large number of neurons.
Overfitting
While a network may fit the training data perfectly, it risks overfitting,
leading to poor performance on unseen data. Regularization techniques, such
as dropout or early stopping, help mitigate this risk.

Anano Tamarashvili, Ani Okropiridze, Mariami Mamageishvili Universal Approximation November 28, 2024 37 / 39
Common Activation Functions

Generalization
The theorem ensures function approximation on the given training set but
does not guarantee generalization to new data. Cross-validation is often used
to ensure better generalization.
Training Difficulties
Although the theorem asserts the existence of an approximation, it offers no
guidance on efficient training. Gradient-based methods may get stuck in local
minima or saddle points, complicating the search for an optimal solution.

Anano Tamarashvili, Ani Okropiridze, Mariami Mamageishvili Universal Approximation November 28, 2024 38 / 39
References

References

https://www.geeksforgeeks.org/
universal-approximation-theorem-for-neural-networks/
http://mathonline.wikidot.com/
applying-lebesgue-s-dominated-convergence-theorem-1?fbclid=
IwZXh0bgNhZW0CMTAAAR1JCH0-0qWPH-Z8PtfwNNO03st43788pkVfWyJHVZqF8R
aem_WSTJPHD-3K52jgb3wLZQRg s
Leonardo Ferreira Guilhoto, An Overview of Artificial Neural Networks for
Mathematicians
https://math.uchicago.edu/~may/REU2018/REUPapers/Guilhoto.pdf
https://www.math3ma.com/blog/dominated-convergence-theorem

Anano Tamarashvili, Ani Okropiridze, Mariami Mamageishvili Universal Approximation November 28, 2024 39 / 39

You might also like