Final2 Math EE
Final2 Math EE
The Role of Matrices and Derivatives in the Mathematical Structure of Neural Networks
Research Question: “To what extent do Matrix Operations and Partial Derivatives
Table Of Contents
Title Page - 1
Introduction - 4-6
SVD and PCA’s part in Neural Networks, Convolutional Neural Networks(CNN) - 21-38
Conclusion - 58
Appendices - 61-81
4
Introduction:
Recently, AI has been in a massive, more than exponential like growth in the past 5 years. As a
person who was always into technology, I got deeply impressed and interested in this type of
technology. While I do enjoy using it, I was also interested in the background of this technology.
Why does it work, how does it work, and how could I also learn how to make something like
I chose this topic to aim to explore the mathematics and complexities behind the AI models
which are very popular today. To start off, "AI", Artificial Intelligence, core idea is that its
"Machine Learning", which means that we humans put in tremendous amounts of data within a
model, and try to make it "learn" what which word is, what each object is, what everything looks
like, and then eventually make it output a word which thinks it suits the input.
It's not as simple as that, however, we want to make it actually intelligent, as the name suggests,
so we use many, many logical operations from both mathematics and computer science to
develop this, and also potentially implement it in real life robotics, which additionally uses
Physics.
5
While the mathematics or computer science that goes behind these AI or ML models is vast, it
can briefly be summarized in a couple of topics in both fields, which include, but are not limited
to: Linear algebra, Calculus(Single and multivariable), Probability and Statistics, Neural
In this essay, I will aim to cover mostly the applications of how and why “Linear Algebra” and
“Calculus” work here, in several different neural networks or even combinations of them.
- Eigenvalues, Eigenvectors
- Linear Transformations/Operators
- Partial Derivatives
- Gradients
If you are unaware of these topics, please go to the appendix at the end of the essay where an
explanation of every important topic is explained clearly. It is not part of the main essay and is
Special Topics
There are a handful of specialized topics specifically and mainly used for machine learning
under Linear Algebra and Calculus that the reader may not know, therefore, the essay will cover
Linear Algebra
Calculus
- Gradient Descent
Singular Value Decomposition is a method to decompose a matrix into three smaller matrices,
which allow us to analyze and manipulate the original matrix in more manageable forms.
Where:
7
To calculate the right singular vectors, we need to calculate . To calculate the left
singular vectors, we need to calculate . For a matrix thats symmetric, that is, if ,
both and are the same, therefore, extra or separate calculations aren’t needed.
Let’s calculate .
Now, to get the “singular values”, we must find the eigenvalues of this matrix.
If the reader wishes they can multiply each matrix here to verify the answer, which will bring
them back to
Disclaimer
Speaking from experience, calculating PCA fully manually, even with the help of calculators, is
a extremely long and tedious process which could take hours, even if it was just a dataset.
While I’ll go over the steps of solving an example carefully, I will need to skip demonstrating on
We have to first “standardize” the data, where we find the mean of each column, and the
standard deviation of each column. After we do so, we have to put each individual X value into a
= 1.81, = 1.91
= 0.785, = 0.846
To calculate each “Z-score” and build a “Z matrix”, we must follow the formula:
This formula tell us that, for just one column of Z, we have to take every single separate value of
X, subtract it by , and divide it by . I’m sure you can now see why I added the disclaimer, as
this would require 10 calculations for just the first column. 20 for the second.
To reduce dimensions, we want to pick the ones with the highest variance(information, value),
since there are only two here, and is the highest one with being negligible, we will only
PCA doesn’t have an objective one way street for the correct answer, the correct answer depends
Full dimensionality reduction? Go ahead and just pick one eigenvalue even if it's a 20D matrix,
who’s stopping you? Other than the fact you’ll lose a lot of data unless every single other
Optimal Reduction? That’s what we just did. Take the few eigenvalues that have the highest
variance, maintaining both quality data without redundant noise and making it much more
manageable.
13
Simple data compression? The dimensions don’t even need to be altered for this, you can simply
take the dot product of every single eigenvector you get with the row spaces, negligible or not,
Gradient Descent
Gradient Descent is a very essential Calculus technique used in neural networks that is
This process continues, and you can see how gets closer to the minimum value (in this case, 0,
in its gradient).
Now that we are done with the prerequisite topics, we can dive into the math behind several
neural networks:
A neural network is a machine learning model inspired from the human brain itself to recognize
patterns. It has three stages: an input layer, a hidden layer, and an output layer. After it receives
an input, it will apply an activation function, some of which are called ReLU, Sigmoid, Tanh to
determine if it should send this information to the next layer or keep backpropagating and
adjusting its weights before sending it to the output layer. The hidden layer is where nearly
everything happens, and there can be multiple hidden layers, we will now go in one of the
Let’s consider a simple problem to give us insight on how the computer does the math in its
hidden layers as we do it ourselves. It is more intuitive to understand what the math achieves
,
16
,
17
18
Now the vital part, using gradient descent, backpropagate to adjust the weights:
19
This can repeat till many repetitions to continuously get a lower loss function and get
closer to the truth label . This example here is a binary classifier which classifies things
between 1 and 0. Let’s assume this means in the context of emails, where 1 is spam
Given many emails at truth label 1, spam, at the first email, it will first assign a random
weight with each feature(input value, can be a word, phrase, or anything else), and
then do this entire mathematical process. At first, it won’t get close to , so it will keep
updating the weights like we did here in this example and then after has finally come
close to , it will notice which weights associated with each feature got increased
drastically during this process, because those features are spam-likely features.
20
At the same scenario in a truth label 0, it will see which ones have decreased, and
At the 2nd, 3rd, and so on amount of emails, by all the processes and emails it done
prior, it will learn which ones have often higher weights and which ones have often
lower weights. After a long amount of training, it will come to a consensus of the weight
When this is done, things such as truth labels won't be there anymore, and it will skip
any processes we did here, and simply do the calculation. and the calculation. After
You might have noticed we never explicitly used PCA or SVD in the process of the
hidden layers itself, however, the data a model has to process can be extremely vast,
with even maybe millions of values and weight values. SVD and PCA can help us
compress the data before we enter it in the training model to save storage and speed up
The CNN is a different type of neural network than binary classifiers as we just did that's
primarily used for image processing and classifying. It can work either just like the
It works differently with grayscale and RGB(color) images. Let’s go through an example
Grayscale
Input Matrix: This represents a portion of the image, with each value (1 to 9) simulating
As for RGB,
What is a Filter?
The filter matrix in a CNN is a small matrix of weights, and it detects patterns or features
in images that it has learnt from previously in the input image. These include edges,
textures, and shapes. There can be multiple variants of each type. A filter to detect
Let’s go through a example of training a filter for grayscale to check for a feature(let’s
Without the bias term, the neural network would be restricted to only producing outputs
that pass through the origin (0, 0), limiting the network’s ability to fit data where the
relationship doesn’t start at zero. Bias terms shift the activation function horizontally,
allowing non-zero outputs even when the input values are zero. For example, in image
recognition, a bias term can help a CNN recognize features in low-light or low-intensity
areas of an image that might otherwise be ignored, improving the network’s flexibility in
Convolution
Now to do the convolution, we will have to break down the input 3x3 into four 2x2s
formula:
24
A truth label in a regular binary classifier simply allows the model to keep adjusting the
weights till its probability is high enough to reach it. It doesn’t say anything about the
weights itself.
25
That means we already know or can just… set the weights of these edges to be ?
To answer:
The trainer puts in an image with edges and they already know this image has edges.
Therefore, they put a target matrix of what they would want to assign the values of
The filter, instead of memorizing the target matrix, which wouldn’t let it adapt to new
Throughout the backpropagation, it will learn patterns of not just that one place but also
the entire image. It will see how the edges themselves or things inside the image differ
when calculating the weights for an edge, and after it learns from its first example in the
training, for the second example the model receives, this time it will not assign random
weights. While it will still obviously be wrong, it assigned weights based on the
appropriate patterns it recognized from the last image. After performing more training
Now to adjust everything to get ready for backpropagation, we need to calculate all this:
To calculate Individually, we have to ignore the and the initially and use
the power rule to differentiate with respect to output. Then after we calculate all the
gradients, we can sum them up, and then average them. We don’t need to do that here,
As for ,
So:
As for,
Since the bias contributes equally to all elements of the output, the gradient of the loss
with respect to the bias is the sum of the gradients of the loss with respect to each
Great, now that we completed training for a lot of filters, how can we apply this to
the input image and how does the rest of the CNN work?
Assuming that we now trained some filters to detect features of a image, let's go back to
the input of the CNN and take a example grayscale matrix with two filters:
In practical scenarios, images are typically in RGB, with three channels(three 3x3
matrices), and have several filters. But for simplicity, we will go through a small
What both the filters here in the example do are arbitrary, one could be for edge-
Perform convolution calculations for each filter on each region(like how we broke down
To introduce non-linearity(like how function graphs can be fully straight if they are linear,
and not that way if they are non-linear. Real life images are highly non-linear, so we
need to introduce non-linearity.), we’ll apply the ReLU (Rectified Linear Unit) activation
function to each feature map. ReLU replaces all negative values with zero, helping the
By making negative values which have extremely low intensity 0, it allows us to focus
Pooling is a process that reduces the spatial size of the feature maps(the output
matrices from the convolution). Its purpose is to make sure we only focus on the most
Since our filters are already only 2x2, we can just take a singular highest value. But for
filters with larger matrices, we would need to slide it over smaller 2x2 windows and get
Next, define weights and biases for the “Fully Connected Layer”:
Assuming we have three classes: cat, dog, and bird. Each class will have a neuron in
the fully connected layer, so we need to calculate three scores based on the input
vector. To do this, we’ll define weights and biases for each class.
34
Now, let’s calculate the score (output) for each class neuron by applying the weights
1.
2.
3.
36
Interpretation
The CNN would predict “bird” for this input, as it has the highest probability (0.47).
This is the big one that’s been exploding in the last 5 years. Transformer architectures
are what produce text. While CNN processes spatial data, transformers process
sequential data like sentences, and all of them entirely at once simultaneously, making
them really fast and efficient. All of the modern LLMs such as ChatGPT, Claude, and
Gemini depend on this architecture, though there are different types of it.
I’ll mainly focus on the GPT architecture here. However other architectures such as the
First, we have to convert raw text into numerical form, which the model then can use in
computations.
1. Tokenization
A token can be a word, part of a word, character, and so on. For a sentence like “The
The next step is to convert it to a specific IDs. These are not random. Imagine you have
"The" → 101
"cat" → 345
"sat" → 789
"." → 12
So, our sentence "The cat sat." becomes the sequence of IDs [101, 345, 789, 12].
38
Imagine you have that same massive dictionary again. Now, what that dictionary will do
is have many rows, where each row contains many columns, and each row represents
one word. This way, we can put each word in a list that's a vector.
each row.
Let’s assume a very simple where the only vocabulary it has is [“The”, “cat”, “sat”, “.”]
where the is 3. Real models have hundreds of thousands of rows, with potentially
So now,
This embedding matrix isn't just random though. We also train this traditionally with
random weights, forward pass, loss, and gradient descent. But the objective this time is
to predict which word comes next. Such as “The cat sat on the ___” where ___ is “mat”.
matrix makes it so similar words are next to each other or close, like “cat” and “dog”,
where as words like “cat” and “quantum” will be really far away in the embedding matrix,
Positional Encoding
While the transformer can see every sentence at once, it doesn’t know which word
comes first. It needs a way to differentiate “The cat sat.” and “Sat the cat.”
So our example has three words. Positions 0, 1, 2. and since each one of them is
each encoding here will also again have three values, and then later we do an operation
Formula:
40
or are not formulas you substitute numbers into. They just say “use this when
So:
41
42
In self attention, we want to perform dot products on the entire 3-dimensional encoding
These are not random either. They are also trained traditionally using backpropagation.
In training , , , we follow all the steps we have done so far, and then assign
random values to these weight matrices. It will usually have a goal, like predicting the
next word in “The cat sat on ___” where “__” is “mat”. This serves as the equivalent to
Since it’s obviously random, it will put a nonsensical word, probably such as “sun”, and
calculate the loss, and backpropagate. This is repeated thousands or millions of times
to build the three weight matrices in training. These three, after training, are the same
for every sentence, and their size also depends on how large the encoding also is. Its
Sophisticated modern models use multiple attention heads, such as one for grammar
rules, another for something else, and so on. Here, we’re only using one.
We’ll assume a trained 3x3 matrix for , , and apply this to our example:
45
46
Since they can get quite large, divide them by the square root of the dimension
Now that the model is context-aware for every word, it will try to generate one next word
which it thinks fits the response. It’s inapplicable here, since our embedding matrix and
, , have only three dimensions, for only three words “The” “Cat” “Sat’,
however, if we assume that we redid this calculation with a more realistic embedding
matrix and , , matrices with more vocabulary, it could try predicting what the
next word would be. Assuming our , , are already trained, it’s highly likely that
its prediction will be correct. It is likely to predict “The” “Cat” “Sat” “On”. Now to predict
the next word even after “on”, it will reconsider the entire sentence so far and do all the
math from the start like how we did for “The” “Cat” Sat” but now including “On”. After it
gets context-aware vectors for every single word once again, it will try to predict the
further word, such as “the” once more, and then reconsider the entire sentence and the
*Refer to “Extra Appendix” for the difference of training an Embedding Matrix and
Conclusion
55
In conclusion, Linear Algebra concepts play a bigger role in the input and output layers
because of concepts such as SVD and PCA, however concepts like matrices, dot
products, vectors are still heavily used inside of the hidden layers. Still, most of the
hidden layers and backpropagation depend on Calculus. While it seems that Calculus is
required to train the models, Linear Algebra is what runs it. Calculus being the fuel, LA
The output of my essay here is my own early learning which is directly relevant towards
my intended career (ML engineer) and also my intention to teach the topics of this
essay in a clear, simplified, and concise manner that anyone interested can learn about
it with ease.
Bibliography:
https://www.accel.ai/anthology/2022/8/17/svd-algorithm-tutorial-in-python.
https://machinelearningmastery.com/principal-components-analysis-for-dimensionality-
reduction-in-python/.
56
Ng, Andrew et al. “Unsupervised Feature Learning and Deep Learning Tutorial.”
https://ufldl.stanford.edu/tutorial/unsupervised/PCAWhitening/.
2023, https://www.geeksforgeeks.org/architecture-and-working-of-transformers-in-deep-
learning/.
Brownlee, Jason. “The Transformer Model.” Machine Learning Mastery, 15 June 2020,
https://machinelearningmastery.com/the-transformer-model/.
mechanism/.
Acknowledgments:
Thanks to the extension “Auto-LaTeX” equations in google docs, I was able to make
every equation.
This was an out of syllabus topic using university concepts, as such, obviously I couldn’t
learn this from formal education and I had to self learn these topics. Thanks to some
online resources I found, I was able to teach myself the mathematical concepts required
multivariable calculus worksheets from UC Berkeley and the linear algebra exercises by
John M. Erdman, that provided a solid foundation for understanding the advanced
Additionally, I would like to thank my Math teachers for giving me additional feedback,
descriptions of Linear Algebra and Multivariable Calculus so that any other peer can still
understand this essay despite being in highschool and not having covered these topics
yet.
Appendices
To start with the simple definition, Matrices are simply like a block of numbers in rows and
columns, like this. This here is a simple 2x2 matrix, which means it has 2
You can subtract or add matrices in a very simple way, provided they have the same size:
- = =
Or
+ = =
Matrix multiplication is quite complex when first looking at it, but I will explain it as we go
through an example.
For matrix multiplication, they don't necessarily need to be the same size, as long as the
columns(in the second matrix) are the same length as one row(in the first matrix), they can be
multiplied.
When you multiply two matrices, you essentially take the corresponding row number in the first
Multiply 1 with 2, since 1 is the first row number and 2 is the first column number, then
2(second) and 1(second). After we do this, for each column multiplied, take the product if it's
being multiplied within the same column between the two, or three, or any number of row
elements.
* = =
A vector is simply a n∗1 or 1∗n matrix, which is like an arrow from the origin of the space in a
R^n dimensional space. They have both direction, magnitudes, and angles.
Dot Products are performed when multiplying two vectors and trying to find a scalar value
which can give us the value of the angle between them. Cross Products can be performed on
vectors at 3 dimensional or higher to find a special third vector that is perpendicular to the plane.
60
Cross Products require cofactor expansions and determinants, which are more complex, and I’ll
For two vectors to be “Orthogonal”, it means they have a dot product of zero.
Determinants
The determinant is a number associated with a square matrix that tells us if the matrix is
invertible.
For a matrix:
61
A matrix is if there exists another matrix such that multiplying them gives the identity
matrix.
Cofactor Expansions
When we want to find the determinant of a 3x3 matrix, like , we need to break it into
=A
=B
=C
63
We want to take the columns below each number in the first row, and then make a mini matrix of
every column below the first numbers excluding the column from the number we are taking it
However, cofactor expansions follow a sign pattern, where it goes + - + - + -, so 2nd cofactor
Cross products work in largely the same way as the cofactor expansion above did, but we’ll go
over an example.
64
Essentially, we are putting each first term, or second term, or third term under each variable.
Take each value of each variable as a coordinate for the cross product vector:
Magnitude of a Vector
The magnitude of the vector tells us how long, how far away from the origin point it is. It can be
In general, it can be used for any n-dimensional space, as long it is like this:
Dot Product: Measures how much two vectors align with each other. The result is a scalar that
Cross Product: Results in a vector that is perpendicular to two original vectors and its magnitude
Examples:
67
68
A vector space is a collection of vectors that can be added and scaled (multiplied by
scalars) while staying within the space. It forms the foundation for many linear algebra
concepts.
If you add the vectors in the space or multiply them by any scalar, then that result
2. Zero Vector
69
The set of the vector space must have a zero vector at some point.
3. Inverse exists
Usually, vectors have both a direction and magnitude. When you multiply a vector by a
matrix, it will change both quantities. However, there are some special vectors that you
can apply to this matrix that will remain their direction, but only scale them, higher or
lower. These special vectors are called Eigenvectors, and their scaling factor is called
the Eigenvalue.
To start, we have to find the Eigenvalues of a matrix, let's take a example matrix:
singular number or lambda just multiplying everything in the matrix with it.) An identity
To find eigenvectors, we go back to the original matrix where we subtracted Matrix A by lambda
For eigenvalue = 3
Now we multiply this by the amount of dimensions, so for here, v1 and v2.
Now we will multiply it to each other and equate it to zero, where we will get a system of
equations, our goal is not to find the values of v1 or v2, but rather to express them in terms of
each other and take their coefficients as the values of our eigenvector.
71
For here:
Eigenvectors can’t be zero all over, and so v1 here can be any number since there’s no
Eigenvalue = 2
Definition: A partial derivative represents the rate of change of a multivariable function with
Definition: The gradient of a multivariable function } f(x, y, z) \text{ is a vector containing all
partial derivatives with respect to each variable. It points in the direction of the steepest increase
of
If you were unaware of every topic and have read this appendix till here, you should now be
ready with all the prerequisite topics to read and comprehend the main essay.
1. What is the difference in training an embedding matrix or the attention weights if they are
An embedding matrix doesn’t predict too many words. It has no idea of the context of the
sentences either, all it does is try to see common patterns or things that have already been said in
76
its training data that is similar to the sentence it needs to predict to gather a sense of
understanding of every word. Self-aware attention matrices can understand the differences
between two words used in different meanings, i.e., “River Bank” or “Money Bank”, whereas a
embedding matrix is only trying to gather every word first for the attention matrix to use in the
first place, its trying to make itself a massive dictionary, and it wants all the relevant words next
to each other. Since cat and dog are both animals and usually compared, they are likely to be
very close to each other in the embedding matrix, whereas the word “car” would likely be further
away from “cat”. This is so the model can quickly look up each word for the given context and
Learning rate is a trial-and-error method we set to train the model at a specific pace. Too high of
a pace can potentially mean that it's gonna attempt to train as fast as possible, which sounds
good, but with such a high learning rate it is very likely to overestimate its predictions and
perhaps enter an even bigger loss in the loss function than the last adjustment. It’s like cramming
for a test you didn’t study for and then just randomly picking out random MCQ options in the
test.
However a slow learning rate allows the model to have more time and comprehend each feature
more comprehensively and lets the model have lower loss. The lower the learning rate the lower
the loss likely is gonna be after training and is most accurate. However a low learning rate can be
extremely slow to train and introduce a much heavier computational lead as it needs to do
Therefore we eventually trial-and-error with different values of to find the most optimal
learning rate which minimizes loss but at the same time gives us the fastest time it can be done