Vahid
Vahid
Vahid Shahverdi
March 2023
1
that maps input x to an output y, where θ is a set of parameters belongs to a set
P that the function fθ depends on. The set of all such functions fθ considered
in this setting is called function space M. The function fθ is typically chosen
to minimize a loss function l(y, fθ (x)), which measures the difference between
the predicted output fθ (x) and the true output y.
Thus, the machine learning task can be expressed as the following optimiza-
tion problems:
2
Figure 1: Linearly separable data presented in a plane
3
Figure 2: Binary classification of data points with non-linear boundary through
kernel from [Gra23].
2 Neural Network
As a mathematician, you can think of a neural network as a function that takes
in some input data, processes it through a series of interconnected layers, and
produces an output. Each layer in the network is composed of nodes (also called
neurons), which perform mathematical operations on the input data.
These mathematical operations typically involve multiplying the input data
by weights (which the network learns through a process called training), adding
a bias term, and passing the result through a non-linear activation function;
see Figure 3. The activation function helps to introduce non-linearity into the
network, which allows it to model complex relationships between inputs and
outputs.
4
Figure 3: A single hidden layer neural network comprising of a single neuron
from [Aji23].
5
Figure 4: The model depicted by the green line is overfitted, whereas the model
represented by the black line is regularized. Although the green line closely
adheres to the training data, it is excessively reliant on that data, rendering it
susceptible to higher error rates when evaluated on fresh, unseen data compared
to the black line; image from [ Ch23].
Several variations of this theorem have been developed to deal with different
types of networks. For instance, in [LLPS93], it is demonstrated that this struc-
ture also applies to networks with multiple hidden layers and non-polynomial
activation functions.
One additional contention posits that the ultimate objective of a neural net-
work bears strong resemblance to interpolation, as referred in Theorem 2.1. In
light of this, a pertinent inquiry would be the outcome of utilizing polynomial
interpolation. To satisfactorily address this query, a comprehensive comprehen-
sion of overfitting is imperative.
2. The study of the loss landscape [LXT+ 18] of a given neural network and
6
a given loss function refers to analyzing the critical points in function
space and in parameter space, e.g., distinguishing local / global minima
from saddle points, or investigating the sharpness / flatness around critical
points. The loss landscape encodes the static properties of the optimiza-
tion problem that do not depend on a choice of optimization algorithm
such a gradient descent.
3. The dynamic optimization properties of a given network, loss, and opti-
mization algorithm entail for instance the convergence behavior to critical
points or understanding the curves traced by the optimization algorithm
[NRT21].
In the upcoming section, we will exhibit how these goals can be approached
with techniques from algebraic or tropical geometry.
7
In the linear setting, the function space MΦ consists of matrices W of size
(dL , d0 ) that can be factorized as WL · · · W1 according to Definition 3.1. It can
be easily verified that MΦ = Mr for r = min(d0 , . . . , dL ).
Remark 3.3. The Eckart-Young Theorem tells us that the ED degree of the
determinantal variety Mr is m r , where m = min(d0 , dL ). Furthermore, it can
be shown that the singular locus of Mr is precisely Mr−1 ⊂ Mr .
The square loss: Given data X ∈ Rd0 ×N and Y ∈ RdL ×N (where N is the
number of data samples), and assuming that XX T has full rank, we write the
quadratic loss as:
l(w) = lX,Y (W ) = ||W X − Y ||2 = ⟨Y, Y ⟩ − 2⟨W X, Y ⟩ + ⟨W X, W X⟩
= const. − 2⟨W (XX T ), Y X T (XX T )−1 + ⟨W X, W X⟩
= const. − 2⟨W, U ⟩XX T + ⟨W, W ⟩XX T
= const. + ||W − U ||2XX T ,
where U is the unconstrained optimizer of this quadratic problem, over the set
of all matrices:
U = argminV ∈RdL ×d0 ||V X − Y ||2 = Y X T (XX T )−1 .
Therefore, the optimization over the function space can be obtained by
min ||W − U ||2XX T .
W ∈Mr
Remark 3.4. Despite the potential challenges posed by the utilization of the
weighted Frobenius norm ||.||XX T , it is important to bear in mind that if our
samples adhere to the independent and identically distributed (i.i.d.) Gaussian
random variable criteria, then E[XX T ] would represent a scalar multiple of the
identity matrix.
µ l|M
L : Rdθ −→ MΦ −→Φ R
µ l|M
(WL , . . . , W1 ) −→ W := WL · · · W1 −→Φ ||W X − Y ||2 .
A machine learning algorithm like gradient descent aims to find critical
points of L in parameter space. However, the meaningful critical points (for
instance, the best function explaining the data) live in the function space M
and are critical points of l|M . Hence, [TKB19] distinguishes between pure criti-
cal points of L (those that actually come from critical points of l|M ) and spurious
critical points (that are only caused by the parametrization map µ); see Figure
5.
8
Figure 5: Pure and spurious critical points: θ1 is a pure critical point, while θ2
is a spurious critical point (the level curves on the manifold MΦ describe the
landscape in functional space). Note that θ3 is mapped to the same function as
θ2 , but it is not a critical point; image from [TKB19].
9
see [CLC22, Lemma 2.3] or [ACGH18, BRTW22].
The discussion so far was restricted to fully-connected / dense linear networks
where every neuron in a layer is connected to all neurons in the next layer and all
weights are independent of each other. There are other types of architectures
that play an important role in practice, e.g., convolutional networks. A first
geometric study on the expressivity, loss landscape, and algebraic invariants of
gradient flow in linear convolutional networks is [KMMT22].
10
a machine learning-based method is proposed for approximating the real dis-
criminant locus of parameterized polynomial systems, with applications in equi-
libria of dynamical systems and scene reconstruction. Similarly, [BLOS22] and
[BHH+ 22] demonstrate the effectiveness of machine learning techniques in pre-
dicting the number of real circles tangent to three conics respectively the geo-
metric properties of Hilbert series. In [DLQ22], new machine learning methods
are presented for efficiently computing numerical Calabi-Yau metrics. Lastly,
[CHLM21] shows that machine learning can significantly speed up the compu-
tation of tensor products and branching rules of irreducible representations of
Lie algebras, which are important for analyzing symmetry in physical systems.
11
Figure 6: The zero sets of two polynomials from different chambers may exhibit
distinct topological properties, this is not the case for polynomials selected from
within the same chamber.
12
Figure 8: Labeling of random samples of degree 2 polynomials based on the
number of real roots.
13
Figure 9: Confusion matrix for predicting the number of real roots for degree 2
polynomials
Figure 10: Confusion matrix for predicting the number of real roots for degree
3 polynomials
14
Figure 11: Confusion matrix for predicting the number of real roots for degree
2 polynomials
Acknowledgement
I would like to express my gratitude to Professor Kathlén Kohn for her valu-
able insights and ideas, as well as for her help in writing some parts of this
lecture note. Her guidance and support have been instrumental in shaping my
understanding of algebraic geometry and machine learning.
I would also like to thank my friend Björn Wehlin for his contributions in
producing the results for my research question, using his expertise in Python
programming. His assistance has been crucial in making this lecture note pos-
sible.
References
[ Ch23] Chabacano. overfitting. https://en.wikipedia.org/wiki/
Overfitting, 2023. [Online; accessed March 18, 2023].
[ACGH18] Sanjeev Arora, Nadav Cohen, Noah Golowich, and Wei Hu. A
convergence analysis of gradient descent for deep linear neural net-
works. arXiv preprint arXiv:1810.02281, 2018.
[Aji23] Ajitesh Kumar. computation via activation. https://vitalflux.
com/perceptron-explained-using-python-example/, 2023.
[Online; accessed March 18, 2023].
15
[Ant23] Antonio Lerario. Cobordism. https://drive.google.com/file/
d/1cwOm6S3M4FxtvUzVA9_-kFhBj2dx0WkX/view, 2023. [Online; ac-
cessed March 18, 2023].
[BH89] Pierre Baldi and Kurt Hornik. Neural networks and principal com-
ponent analysis: Learning from examples without local minima.
Neural networks, 2(1):53–58, 1989.
[BHH+ 22] Jiakang Bao, Yang-Hui He, Edward Hirst, Johannes Hofscheier,
Alexander Kasprzyk, and Suvajit Majumder. Hilbert series, ma-
chine learning, and applications to physics. Physics Letters B,
827:136966, 2022.
[BHM+ 20] Edgar A Bernal, Jonathan D Hauenstein, Dhagash Mehta, Mar-
garet H Regan, and Tingting Tang. Machine learning the real dis-
criminant locus. arXiv preprint arXiv:2006.14078, 2020.
[BLOS22] Paul Breiding, Julia Lindberg, Wern Juin Gabriel Ong, and Li-
nus Sommer. Real circles tangent to 3 conics. arXiv preprint
arXiv:2211.06876, 2022.
[BRTW22] Bubacarr Bah, Holger Rauhut, Ulrich Terstiege, and Michael West-
dickenberg. Learning deep linear neural networks: Riemannian gra-
dient flows and convergence to global minimizers. Information and
Inference: A Journal of the IMA, 11(1):307–353, 2022.
[CHLM21] Heng-Yu Chen, Yang-Hui He, Shailesh Lal, and Suvajit Majumder.
Machine learning lie structures & applications to physics. Physics
Letters B, 817:136297, 2021.
[CLC22] Yacine Chitour, Zhenyu Liao, and Romain Couillet. A geometric
approach of gradient descent algorithms in linear neural networks.
Mathematical Control and Related Fields, pages 0–0, 2022.
[DLQ22] Michael Douglas, Subramanian Lakshminarasimhan, and Yidi Qi.
Numerical calabi-yau metrics from holomorphic networks. In Math-
ematical and Scientific Machine Learning, pages 223–252. PMLR,
2022.
[GL22] J Elisenda Grigsby and Kathryn Lindsey. On transversality of bent
hyperplane arrangements and the topological expressiveness of relu
neural networks. SIAM Journal on Applied Algebra and Geometry,
6(2):216–242, 2022.
[Gra23] Grace Zhang. Kernel. https://medium.com/@zxr.nju/
what-is-the-kernel-trick-why-is-it-important-98a98db0961d,
2023. [Online; accessed March 18, 2023].
[GRK20] Ingo Gühring, Mones Raslan, and Gitta Kutyniok. Expressivity of
deep neural networks. arXiv preprint arXiv:2007.04759, 2020.
16
[HKS22] Kathryn Heal, Avinash Kulkarni, and Emre Can Sertöz. Deep
learning gauss–manin connections. Advances in Applied Clifford
Algebras, 32(2):24, 2022.
[Kaw16] Kenji Kawaguchi. Deep learning without poor local minima. Ad-
vances in neural information processing systems, 29, 2016.
[KMMT22] Kathlén Kohn, Thomas Merkh, Guido Montúfar, and Matthew
Trager. Geometry of linear convolutional networks. SIAM Jour-
nal on Applied Algebra and Geometry, 6(3):368–406, 2022.
[KTB19] Joe Kileel, Matthew Trager, and Joan Bruna. On the expressive
power of deep polynomial neural networks. Advances in neural
information processing systems, 32, 2019.
[LB18] Thomas Laurent and James Brecht. Deep linear networks with arbi-
trary loss: All local minima are global. In International conference
on machine learning, pages 2902–2907. PMLR, 2018.
[LLPS93] Moshe Leshno, Vladimir Ya Lin, Allan Pinkus, and Shimon
Schocken. Multilayer feedforward networks with a nonpolynomial
activation function can approximate any function. Neural networks,
6(6):861–867, 1993.
[LXT+ 18] Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, and Tom Gold-
stein. Visualizing the loss landscape of neural nets. Advances in
neural information processing systems, 31, 2018.
[MCTH21] Dhagash Mehta, Tianran Chen, Tingting Tang, and Jonathan D
Hauenstein. The loss surface of deep linear networks viewed through
the algebraic geometry lens. IEEE Transactions on Pattern Anal-
ysis and Machine Intelligence, 44(9):5664–5680, 2021.
[MPTV06] Bernard Mourrain, Nicos G Pavlidis, Dimitris K Tasoulis, and
Michael N Vrahatis. Determining the number of real roots of poly-
nomials through neural networks. Computers & Mathematics with
Applications, 51(3-4):527–536, 2006.
[MRZ22] Guido Montúfar, Yue Ren, and Leon Zhang. Sharp bounds for the
number of regions of maxout networks and vertices of minkowski
sums. SIAM Journal on Applied Algebra and Geometry, 6(4):618–
649, 2022.
[NRT21] Gabin Maxime Nguegnang, Holger Rauhut, and Ulrich Terstiege.
Convergence of gradient descent for learning linear neural networks.
arXiv preprint arXiv:2108.02040, 2021.
[Saz06] Murat H Sazli. A brief review of feed-forward neural networks.
Communications Faculty of Sciences University of Ankara Series
A2-A3 Physical Sciences and Engineering, 50(01), 2006.
17
[SV08] Alex Smola and SVN Vishwanathan. Introduction to machine learn-
ing. Cambridge University, UK, 32(34):2008, 2008.
[TFHV21] Salma Tarmoun, Guilherme Franca, Benjamin D Haeffele, and Rene
Vidal. Understanding the dynamics of gradient flow in overparam-
eterized linear models. In International Conference on Machine
Learning, pages 10153–10161. PMLR, 2021.
[TKB19] Matthew Trager, Kathlén Kohn, and Joan Bruna. Pure and spu-
rious critical points: a geometric study of linear networks. arXiv
preprint arXiv:1910.01671, 2019.
18