0% found this document useful (0 votes)
13 views77 pages

Final2 Math EE

Math EE written by me for 2023 MAY

Uploaded by

Cavid
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views77 pages

Final2 Math EE

Math EE written by me for 2023 MAY

Uploaded by

Cavid
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 77

1

The Role of Matrices and Derivatives in the Mathematical Structure of Neural Networks

Word Count: 3997

Research Question: “To what extent do Matrix Operations and Partial Derivatives

underpin the mathematical framework and functionality of neural networks?”

Subject: Mathematics, Analysis and Approaches, Higher Level


2

Table Of Contents

Title Page - 1

Table Of Contents - 2-3

Introduction - 4-6

What is SVD? - 6-9

What is PCA? - 9-13

Gradient Descent and Neural Networks - 13-14

Neural Network - Forward Pass Binary Classifier - 15-19

What the Neural Network achieved - 20-21

SVD and PCA’s part in Neural Networks, Convolutional Neural Networks(CNN) - 21-38

Transformer Architecture - 38-58


3

Conclusion - 58

Bibliography, Acknowledgements - 59-60

Appendices - 61-81
4

Introduction:

Recently, AI has been in a massive, more than exponential like growth in the past 5 years. As a

person who was always into technology, I got deeply impressed and interested in this type of

technology. While I do enjoy using it, I was also interested in the background of this technology.

Why does it work, how does it work, and how could I also learn how to make something like

this? Or teach other people to make something like this?

I chose this topic to aim to explore the mathematics and complexities behind the AI models

which are very popular today. To start off, "AI", Artificial Intelligence, core idea is that its

"Machine Learning", which means that we humans put in tremendous amounts of data within a

model, and try to make it "learn" what which word is, what each object is, what everything looks

like, and then eventually make it output a word which thinks it suits the input.

It's not as simple as that, however, we want to make it actually intelligent, as the name suggests,

so we use many, many logical operations from both mathematics and computer science to

develop this, and also potentially implement it in real life robotics, which additionally uses

Physics.
5

While the mathematics or computer science that goes behind these AI or ML models is vast, it

can briefly be summarized in a couple of topics in both fields, which include, but are not limited

to: Linear algebra, Calculus(Single and multivariable), Probability and Statistics, Neural

networks, Coding in multiple different languages, and more.

In this essay, I will aim to cover mostly the applications of how and why “Linear Algebra” and

“Calculus” work here, in several different neural networks or even combinations of them.

These topics are required and used in this essay:

Topics of Linear Algebra In The Essay:

- Matrices, Matrix Operations

- Vectors, Vector Spaces

- Determinants, Cofactor Expansions

- Eigenvalues, Eigenvectors

- Linear Transformations/Operators

Topics of Calculus In This Essay:

- Partial Derivatives

- Partial Derivatives with Chain Rule

- Gradients

If you are unaware of these topics, please go to the appendix at the end of the essay where an

explanation of every important topic is explained clearly. It is not part of the main essay and is

only provided for additional understanding of a peer reader.


6

Special Topics

There are a handful of specialized topics specifically and mainly used for machine learning

under Linear Algebra and Calculus that the reader may not know, therefore, the essay will cover

a basic comprehension of teaching or provide an insight to these prerequisite topics before it

goes into the real research.

Linear Algebra

- Singular Value Decomposition(SVD)

- Principal Component Analysis(PCA)

Calculus

- Gradient Descent

The fundamental topics before we begin - What is SVD?

Singular Value Decomposition is a method to decompose a matrix into three smaller matrices,

which allow us to analyze and manipulate the original matrix in more manageable forms.

Where:
7

How do you calculate each component?

Let’s go through an example as we explain.

To calculate the right singular vectors, we need to calculate . To calculate the left

singular vectors, we need to calculate . For a matrix thats symmetric, that is, if ,

both and are the same, therefore, extra or separate calculations aren’t needed.

Let’s calculate .

Now, to get the “singular values”, we must find the eigenvalues of this matrix.

, which will complete


8
9

If the reader wishes they can multiply each matrix here to verify the answer, which will bring

them back to

Principal Component Analysis

Principal Component Analysis is a decomposition method used to reduce the number of

variables(dimensions) while retaining most of the variance(information).

Disclaimer

Speaking from experience, calculating PCA fully manually, even with the help of calculators, is

a extremely long and tedious process which could take hours, even if it was just a dataset.

While I’ll go over the steps of solving an example carefully, I will need to skip demonstrating on

a lot of tedious steps, which are too long to represent here.

How do you calculate it?

I will explore the case of optimal dimensionality reduction first.


10

We have to first “standardize” the data, where we find the mean of each column, and the

standard deviation of each column. After we do so, we have to put each individual X value into a

formula along with this mean and standard deviation we obtained.

= 1.81, = 1.91

= 0.785, = 0.846

To calculate each “Z-score” and build a “Z matrix”, we must follow the formula:

This formula tell us that, for just one column of Z, we have to take every single separate value of

X, subtract it by , and divide it by . I’m sure you can now see why I added the disclaimer, as

this would require 10 calculations for just the first column. 20 for the second.

After everything, the total Z matrix is:


11

To reduce dimensions, we want to pick the ones with the highest variance(information, value),

since there are only two here, and is the highest one with being negligible, we will only

find the eigenvector/principal component of .


12

The flexibility of PCA

Notice how I said “only one case”?

PCA doesn’t have an objective one way street for the correct answer, the correct answer depends

on what you’re looking for.

Full dimensionality reduction? Go ahead and just pick one eigenvalue even if it's a 20D matrix,

who’s stopping you? Other than the fact you’ll lose a lot of data unless every single other

eigenvalue is actually small.

Optimal Reduction? That’s what we just did. Take the few eigenvalues that have the highest

variance, maintaining both quality data without redundant noise and making it much more

manageable.
13

Simple data compression? The dimensions don’t even need to be altered for this, you can simply

take the dot product of every single eigenvector you get with the row spaces, negligible or not,

and get a of just compressed but same dimensional data.

Gradient Descent

Gradient Descent is a very essential Calculus technique used in neural networks that is

responsible for the actual recording or learning of information in a machine.

In regular functions, it’s used to find the minimum point of a function.

This process continues, and you can see how gets closer to the minimum value (in this case, 0,

in its gradient).

Now that we are done with the prerequisite topics, we can dive into the math behind several

neural networks:

First of all, what is a neural network?


14

A neural network is a machine learning model inspired from the human brain itself to recognize

patterns. It has three stages: an input layer, a hidden layer, and an output layer. After it receives

an input, it will apply an activation function, some of which are called ReLU, Sigmoid, Tanh to

determine if it should send this information to the next layer or keep backpropagating and

adjusting its weights before sending it to the output layer. The hidden layer is where nearly

everything happens, and there can be multiple hidden layers, we will now go in one of the

simplest neural networks:

The Simplest Neural Network - A Forward Pass


15

Let’s consider a simple problem to give us insight on how the computer does the math in its

hidden layers as we do it ourselves. It is more intuitive to understand what the math achieves

after we do it than before we did it.

,
16

,
17
18

Now the vital part, using gradient descent, backpropagate to adjust the weights:
19

Finally, what did doing all this achieve?

This can repeat till many repetitions to continuously get a lower loss function and get

closer to the truth label . This example here is a binary classifier which classifies things

between 1 and 0. Let’s assume this means in the context of emails, where 1 is spam

and 0 is not spam.

Given many emails at truth label 1, spam, at the first email, it will first assign a random

weight with each feature(input value, can be a word, phrase, or anything else), and

then do this entire mathematical process. At first, it won’t get close to , so it will keep

updating the weights like we did here in this example and then after has finally come

close to , it will notice which weights associated with each feature got increased

drastically during this process, because those features are spam-likely features.
20

At the same scenario in a truth label 0, it will see which ones have decreased, and

those are non spam likely words.

At the 2nd, 3rd, and so on amount of emails, by all the processes and emails it done

prior, it will learn which ones have often higher weights and which ones have often

lower weights. After a long amount of training, it will come to a consensus of the weight

of nearly every feature that can appear in an email.

When this is done, things such as truth labels won't be there anymore, and it will skip

any processes we did here, and simply do the calculation. and the calculation. After

analyzing if is closer to 1 or 0, it will rightfully classify it as spam or non spam.

*Refer to “Extra Appendix” for “Learning Rate”

What’s the role of SVD or PCA here?

You might have noticed we never explicitly used PCA or SVD in the process of the

hidden layers itself, however, the data a model has to process can be extremely vast,

with even maybe millions of values and weight values. SVD and PCA can help us

compress the data before we enter it in the training model to save storage and speed up

the learning/training process.

Next - The Convolutional Neural Network(CNN)


21

The CNN is a different type of neural network than binary classifiers as we just did that's

primarily used for image processing and classifying. It can work either just like the

binary classifier(e.g, “cat or dog”) or multi-modal(e.g, “cat, dog, bird, etc”).

It works differently with grayscale and RGB(color) images. Let’s go through an example

as I explain the concepts and the math.

Grayscale

Input Matrix: This represents a portion of the image, with each value (1 to 9) simulating

a pixel's intensity (in grayscale).

As for RGB,

Each color is separated into three channels.


22

Now the same filter:

Can be applied independently to each channel separately.

What is a Filter?

The filter matrix in a CNN is a small matrix of weights, and it detects patterns or features

in images that it has learnt from previously in the input image. These include edges,

textures, and shapes. There can be multiple variants of each type. A filter to detect

vertical edges, horizontal, diagonal. Or each different shape, texture.

Let’s go through a example of training a filter for grayscale to check for a feature(let’s

say edges here for example):

Why do we use this Bias Term?


23

Without the bias term, the neural network would be restricted to only producing outputs

that pass through the origin (0, 0), limiting the network’s ability to fit data where the

relationship doesn’t start at zero. Bias terms shift the activation function horizontally,

allowing non-zero outputs even when the input values are zero. For example, in image

recognition, a bias term can help a CNN recognize features in low-light or low-intensity

areas of an image that might otherwise be ignored, improving the network’s flexibility in

detecting subtle patterns.

Convolution

Now to do the convolution, we will have to break down the input 3x3 into four 2x2s

To perform convolution with the , each of them will follow this

formula:
24

However this can raise some questions.

A truth label in a regular binary classifier simply allows the model to keep adjusting the

weights till its probability is high enough to reach it. It doesn’t say anything about the

weights itself.
25

But this formula here is directly related to the weights.

If we already know the target here is

That means we already know or can just… set the weights of these edges to be ?

To answer:

The trainer puts in an image with edges and they already know this image has edges.

Therefore, they put a target matrix of what they would want to assign the values of

these edges as.

The filter, instead of memorizing the target matrix, which wouldn’t let it adapt to new

images, will try to understand why that “place” is a or an edge.

Throughout the backpropagation, it will learn patterns of not just that one place but also

the entire image. It will see how the edges themselves or things inside the image differ

when calculating the weights for an edge, and after it learns from its first example in the

training, for the second example the model receives, this time it will not assign random

weights. While it will still obviously be wrong, it assigned weights based on the

appropriate patterns it recognized from the last image. After performing more training

and backpropagation, it will learn further patterns in differentiating these images.

To continue with the math:


26

First also add every output by the bias term:


27

Now to adjust everything to get ready for backpropagation, we need to calculate all this:

For simplicity’s sake, we will only calculate and

Using the chain rule:

To calculate Individually, we have to ignore the and the initially and use

the power rule to differentiate with respect to output. Then after we calculate all the

gradients, we can sum them up, and then average them. We don’t need to do that here,

considering we are only doing one gradient.


28

As for ,

So:

Now perform backpropagation:

As for,

Since the bias contributes equally to all elements of the output, the gradient of the loss

with respect to the bias is the sum of the gradients of the loss with respect to each

output element where that bias was used.

Then we can use gradient descent to adjust it:


29

Great, now that we completed training for a lot of filters, how can we apply this to

the input image and how does the rest of the CNN work?

Assuming that we now trained some filters to detect features of a image, let's go back to

the input of the CNN and take a example grayscale matrix with two filters:

In practical scenarios, images are typically in RGB, with three channels(three 3x3

matrices), and have several filters. But for simplicity, we will go through a small

grayscale matrix with two filters.

Each value here can represent the intensity of a pixel.

What both the filters here in the example do are arbitrary, one could be for edge-

detecting, one could be for texture detecting, etc.

Perform convolution calculations for each filter on each region(like how we broke down

the earlier 3x3 into four 2x2s)


30
31

Next step: Apply a Activation Function

To introduce non-linearity(like how function graphs can be fully straight if they are linear,

and not that way if they are non-linear. Real life images are highly non-linear, so we

need to introduce non-linearity.), we’ll apply the ReLU (Rectified Linear Unit) activation

function to each feature map. ReLU replaces all negative values with zero, helping the

network focus on important positive activations.

ReLU is defined as:

By making negative values which have extremely low intensity 0, it allows us to focus

only on the important features, reducing computational load.

Applying ReLU to our example:


32

Next Step: Apply Pooling to the Feature maps

Pooling is a process that reduces the spatial size of the feature maps(the output

matrices from the convolution). Its purpose is to make sure we only focus on the most

important details of each feature map.

Let’s make Max Pooling:

Since our filters are already only 2x2, we can just take a singular highest value. But for

filters with larger matrices, we would need to slide it over smaller 2x2 windows and get

each individual max value. Here, we just get one:


33

Next, create a flattened vector of all the pooled values:

Next, define weights and biases for the “Fully Connected Layer”:

Assuming we have three classes: cat, dog, and bird. Each class will have a neuron in

the fully connected layer, so we need to calculate three scores based on the input

vector. To do this, we’ll define weights and biases for each class.
34

Step 2: Calculate Scores for Each Class

Now, let’s calculate the score (output) for each class neuron by applying the weights

and biases to the flattened vector.


35

Step 3: Apply “Softmax” to get the probabilities

1.

2.

3.
36

Interpretation

The CNN would predict “bird” for this input, as it has the highest probability (0.47).

These probabilities reflect the network’s confidence in each class.

Next Neural Network - Transformer Architecture

This is the big one that’s been exploding in the last 5 years. Transformer architectures

are what produce text. While CNN processes spatial data, transformers process

sequential data like sentences, and all of them entirely at once simultaneously, making

them really fast and efficient. All of the modern LLMs such as ChatGPT, Claude, and

Gemini depend on this architecture, though there are different types of it.

I’ll mainly focus on the GPT architecture here. However other architectures such as the

original transformer, BERT, T5 exist.


37

The GPT architecture, which should be obvious, was developed by OpenAI

specifically/specially for ChatGPT.

Training and Usage of GPT

First, we have to convert raw text into numerical form, which the model then can use in

computations.

1. Tokenization

A token can be a word, part of a word, character, and so on. For a sentence like “The

cat sat.”, the tokenization would be [“The”, “cat”, “sat”, “.”]

The next step is to convert it to a specific IDs. These are not random. Imagine you have

a massive dictionary and each word has a unique integer ID.

So let’s say [“The”, “cat”, “sat”, “.”] becomes:

"The" → 101

"cat" → 345

"sat" → 789

"." → 12

So, our sentence "The cat sat." becomes the sequence of IDs [101, 345, 789, 12].
38

1.2 Embedding each token.

Imagine you have that same massive dictionary again. Now, what that dictionary will do

is have many rows, where each row contains many columns, and each row represents

one word. This way, we can put each word in a list that's a vector.

The model has an embedding matrix of shape , where is the vocabulary

size(amount of rows), and is the embedding dimension, the amount of columns in

each row.

Let’s assume a very simple where the only vocabulary it has is [“The”, “cat”, “sat”, “.”]

where the is 3. Real models have hundreds of thousands of rows, with potentially

thousands of dimensions, to make them all unique and integers.

It may look something like this:

So now,

ID[101] = [0.1, 0.3, -0.4]

ID[345] = [0.5, -0.6, 0.2]


39

ID[789] = [0.3, 0.8, -0.9]

ID[12] = [-0.7, 0.2, 0.4]

This embedding matrix isn't just random though. We also train this traditionally with

random weights, forward pass, loss, and gradient descent. But the objective this time is

to predict which word comes next. Such as “The cat sat on the ___” where ___ is “mat”.

The truth label/target equivalent here is it has to predict “mat”.Training a embedding

matrix makes it so similar words are next to each other or close, like “cat” and “dog”,

where as words like “cat” and “quantum” will be really far away in the embedding matrix,

so that it knows the relationships and context of each word.

Positional Encoding

While the transformer can see every sentence at once, it doesn’t know which word

comes first. It needs a way to differentiate “The cat sat.” and “Sat the cat.”

So our example has three words. Positions 0, 1, 2. and since each one of them is

represented by a dimension of 3. They have a “index” of 0, 1, 2 as well. Which means

each encoding here will also again have three values, and then later we do an operation

with the encoding and the embedding.

Formula:
40

or are not formulas you substitute numbers into. They just say “use this when

the position index is even or odd,”:

So:
41
42

Now we add each of these vectors with the embedding:


43

Next we pass these on to the self attention mechanism.

In self attention, we want to perform dot products on the entire 3-dimensional encoding

space by another 3 3x3’s matrices, called , ,

Q = Query, What this word is “looking for” (relevance to other words).

K = Key, How relevant this word is to other words.

V = Value, The content or meaning of this word to be shared with others.

These are not random either. They are also trained traditionally using backpropagation.

In training , , , we follow all the steps we have done so far, and then assign

random values to these weight matrices. It will usually have a goal, like predicting the

next word in “The cat sat on ___” where “__” is “mat”. This serves as the equivalent to

the “truth label.” for the transformer.

Since it’s obviously random, it will put a nonsensical word, probably such as “sun”, and

calculate the loss, and backpropagate. This is repeated thousands or millions of times

to build the three weight matrices in training. These three, after training, are the same

for every sentence, and their size also depends on how large the encoding also is. Its

size will always be equivalent to the encoding itself.


44

Sophisticated modern models use multiple attention heads, such as one for grammar

rules, another for something else, and so on. Here, we’re only using one.

We’ll assume a trained 3x3 matrix for , , and apply this to our example:
45
46

Additionally do the same for every other word(which won’t be shown):

Next we will perform a dot product of each pair of and

Since they can get quite large, divide them by the square root of the dimension

These are the “attention scores.”


47
48
49
50

Apply Softmax to get Attention Weights:


51

Final step, the weighted sum of V vectors.


52
53
54

Now what did this achieve?

Now that the model is context-aware for every word, it will try to generate one next word

which it thinks fits the response. It’s inapplicable here, since our embedding matrix and

, , have only three dimensions, for only three words “The” “Cat” “Sat’,

however, if we assume that we redid this calculation with a more realistic embedding

matrix and , , matrices with more vocabulary, it could try predicting what the

next word would be. Assuming our , , are already trained, it’s highly likely that

its prediction will be correct. It is likely to predict “The” “Cat” “Sat” “On”. Now to predict

the next word even after “on”, it will reconsider the entire sentence so far and do all the

math from the start like how we did for “The” “Cat” Sat” but now including “On”. After it

gets context-aware vectors for every single word once again, it will try to predict the

further word, such as “the” once more, and then reconsider the entire sentence and the

math again, and then predict “mat”.

*Refer to “Extra Appendix” for the difference of training an Embedding Matrix and

the attention matrices.

Conclusion
55

In conclusion, Linear Algebra concepts play a bigger role in the input and output layers

because of concepts such as SVD and PCA, however concepts like matrices, dot

products, vectors are still heavily used inside of the hidden layers. Still, most of the

hidden layers and backpropagation depend on Calculus. While it seems that Calculus is

required to train the models, Linear Algebra is what runs it. Calculus being the fuel, LA

being the car.

The output of my essay here is my own early learning which is directly relevant towards

my intended career (ML engineer) and also my intention to teach the topics of this

essay in a clear, simplified, and concise manner that anyone interested can learn about

it with ease.

Bibliography:

Accel.AI. “SVD Algorithm Tutorial in Python.” Accel AI Anthology, 17 Aug. 2022,

https://www.accel.ai/anthology/2022/8/17/svd-algorithm-tutorial-in-python.

Brownlee, Jason. “Principal Component Analysis for Dimensionality Reduction in

Python.” Machine Learning Mastery, 25 Feb. 2021,

https://machinelearningmastery.com/principal-components-analysis-for-dimensionality-

reduction-in-python/.
56

Ng, Andrew et al. “Unsupervised Feature Learning and Deep Learning Tutorial.”

Stanford University: UFLDL Tutorial, Stanford University,

https://ufldl.stanford.edu/tutorial/unsupervised/PCAWhitening/.

Arman Asq. “Understanding Self-Attention - A Step-by-Step Guide.” Arman Asq Blog,

20 June 2022, https://armanasq.github.io/nlp/self-attention/.

“Architecture and Working of Transformers in Deep Learning.” GeeksforGeeks, 30 Jan.

2023, https://www.geeksforgeeks.org/architecture-and-working-of-transformers-in-deep-

learning/.

Brownlee, Jason. “The Transformer Model.” Machine Learning Mastery, 15 June 2020,

https://machinelearningmastery.com/the-transformer-model/.

Brownlee, Jason. “The Transformer Attention Mechanism.” Machine Learning Mastery,

25 June 2020, https://machinelearningmastery.com/the-transformer-attention-

mechanism/.

Acknowledgments:

Thanks to the extension “Auto-LaTeX” equations in google docs, I was able to make

every equation.

This was an out of syllabus topic using university concepts, as such, obviously I couldn’t

learn this from formal education and I had to self learn these topics. Thanks to some

online resources I found, I was able to teach myself the mathematical concepts required

for the essay.


57

I would like to extend my gratitude to the resources, particularly the Math 53

multivariable calculus worksheets from UC Berkeley and the linear algebra exercises by

John M. Erdman, that provided a solid foundation for understanding the advanced

mathematics necessary for this essay.

Additionally, I would like to thank my Math teachers for giving me additional feedback,

giving me a opportunity to refine my research question, and to add prerequisite

descriptions of Linear Algebra and Multivariable Calculus so that any other peer can still

understand this essay despite being in highschool and not having covered these topics

yet.

Appendices

Appendix 1: Matrices, Arithmetic Operations on Matrices:

To start with the simple definition, Matrices are simply like a block of numbers in rows and

columns, like this. This here is a simple 2x2 matrix, which means it has 2

rows(horizontal) and 2 columns(vertical). They can extend up to more dimensions, such as

, where you can make it a 3x3 matrix.


58

Now what can we actually do with these?

You can subtract or add matrices in a very simple way, provided they have the same size:

- = =

Or

+ = =

What About Multiplication?

Matrix multiplication is quite complex when first looking at it, but I will explain it as we go

through an example.

For matrix multiplication, they don't necessarily need to be the same size, as long as the

columns(in the second matrix) are the same length as one row(in the first matrix), they can be

multiplied.

When you multiply two matrices, you essentially take the corresponding row number in the first

number to the corresponding column number, for example:

Here, we will now take it like this:


59

Multiply 1 with 2, since 1 is the first row number and 2 is the first column number, then

2(second) and 1(second). After we do this, for each column multiplied, take the product if it's

being multiplied within the same column between the two, or three, or any number of row

elements.

Which will result in:

Now a example with two columns/rows:

* = =

Appendix 2: Vectors, Dot Products

A vector is simply a n∗1 or 1∗n matrix, which is like an arrow from the origin of the space in a

R^n dimensional space. They have both direction, magnitudes, and angles.

Dot Products are performed when multiplying two vectors and trying to find a scalar value

which can give us the value of the angle between them. Cross Products can be performed on

vectors at 3 dimensional or higher to find a special third vector that is perpendicular to the plane.
60

Cross Products require cofactor expansions and determinants, which are more complex, and I’ll

explain them along with the next topic.

Important keyword: “Orthogonal”

For two vectors to be “Orthogonal”, it means they have a dot product of zero.

Geometrically, this means they are perpendicular.

Appendix 3: Matrix Properties, Determinants

Determinants

The determinant is a number associated with a square matrix that tells us if the matrix is

invertible.

For a matrix:
61

The determinant is calculated as:

If the determinant is 0, the matrix is singular (non-invertible); otherwise, it’s invertible.

Invertible Matrix and Inverse Example

A matrix is if there exists another matrix such that multiplying them gives the identity

matrix.

For example, if:

Its inverse is:


62

Which simplifies to:

The matrix is invertible because its determinant is non-zero.

Cofactor Expansions

When we want to find the determinant of a 3x3 matrix, like , we need to break it into

multiple 2x2 matrices.

We can do it like this, let's name the three A, B, C

=A

=B

=C
63

We want to take the columns below each number in the first row, and then make a mini matrix of

every column below the first numbers excluding the column from the number we are taking it

with, like the example shown.

However, cofactor expansions follow a sign pattern, where it goes + - + - + -, so 2nd cofactor

expansions, like 3 here, are usually negative.

Now we can individually take the determinant of each matrix.

Appendix 4: Cross Products, Magnitude of a vector

Cross products work in largely the same way as the cofactor expansion above did, but we’ll go

over an example.
64

Essentially, we are putting each first term, or second term, or third term under each variable.

Now we can take the Cofactor for this.

Take each value of each variable as a coordinate for the cross product vector:

Magnitude of a Vector

The magnitude of the vector tells us how long, how far away from the origin point it is. It can be

calculated in a simple way using the Pythagorean Theorem:


65

For example, if , then its magnitude is:

In general, it can be used for any n-dimensional space, as long it is like this:

What Does Doing a Dot Product or a Cross Product actually achieve?

Dot Product: Measures how much two vectors align with each other. The result is a scalar that

represents the projection of one vector onto the other.

Cross Product: Results in a vector that is perpendicular to two original vectors and its magnitude

represents the area of the parallelogram formed by them.

We can calculate the angle of each of the products like this:


66

Examples:
67
68

Appendix 5: Vector Spaces

A vector space is a collection of vectors that can be added and scaled (multiplied by

scalars) while staying within the space. It forms the foundation for many linear algebra

concepts.

It needs to follow three conditions, called axioms, in order to be a vector space:

1. Closed under Addition and Scalar Multiplication:

If you add the vectors in the space or multiply them by any scalar, then that result

should also be in the set of the vector space.

2. Zero Vector
69

The set of the vector space must have a zero vector at some point.

3. Inverse exists

Each vector has an inverse.

Appendix 6: Eigenvalues and Eigenvectors

Eigenvalues and Eigenvectors

Usually, vectors have both a direction and magnitude. When you multiply a vector by a

matrix, it will change both quantities. However, there are some special vectors that you

can apply to this matrix that will remain their direction, but only scale them, higher or

lower. These special vectors are called Eigenvectors, and their scaling factor is called

the Eigenvalue.

To start, we have to find the Eigenvalues of a matrix, let's take a example matrix:

Subtract this matrix by an identity matrix multiplied by lambda(multiplying a matrix by a

singular number or lambda just multiplying everything in the matrix with it.) An identity

matrix is just full of diagonal ones and zeros elsewhere.


70

Find the determinant, and equate it to zero.

Now the Eigenvalues are 3 and 2

How to Find Eigenvectors?

To find eigenvectors, we go back to the original matrix where we subtracted Matrix A by lambda

and replace them with the eigenvalues we got for lambda:

For eigenvalue = 3

Now we multiply this by the amount of dimensions, so for here, v1 and v2.

Now we will multiply it to each other and equate it to zero, where we will get a system of

equations, our goal is not to find the values of v1 or v2, but rather to express them in terms of

each other and take their coefficients as the values of our eigenvector.
71

For here:

Eigenvectors can’t be zero all over, and so v1 here can be any number since there’s no

equation for it in the first place, so we will pick it as 1:

For the next Eigenvalue:

Eigenvalue = 2

Appendix 7: Partial Derivatives


72

Definition: A partial derivative represents the rate of change of a multivariable function with

respect to one variable, keeping the other variables constant.


73

Appendix 8: Gradient of a Multivariable Function


74

Definition: The gradient of a multivariable function } f(x, y, z) \text{ is a vector containing all

partial derivatives with respect to each variable. It points in the direction of the steepest increase

of

Appendix 9: Chain Rule on Entire Functions


75

If you were unaware of every topic and have read this appendix till here, you should now be

ready with all the prerequisite topics to read and comprehend the main essay.

Extra Appendix - Two Questions

1. What is the difference in training an embedding matrix or the attention weights if they are

both predicting next words?

An embedding matrix doesn’t predict too many words. It has no idea of the context of the

sentences either, all it does is try to see common patterns or things that have already been said in
76

its training data that is similar to the sentence it needs to predict to gather a sense of

understanding of every word. Self-aware attention matrices can understand the differences

between two words used in different meanings, i.e., “River Bank” or “Money Bank”, whereas a

embedding matrix is only trying to gather every word first for the attention matrix to use in the

first place, its trying to make itself a massive dictionary, and it wants all the relevant words next

to each other. Since cat and dog are both animals and usually compared, they are likely to be

very close to each other in the embedding matrix, whereas the word “car” would likely be further

away from “cat”. This is so the model can quickly look up each word for the given context and

embed it more efficiently.

2. What is “learning rate"?

Learning rate is a trial-and-error method we set to train the model at a specific pace. Too high of

a pace can potentially mean that it's gonna attempt to train as fast as possible, which sounds

good, but with such a high learning rate it is very likely to overestimate its predictions and

perhaps enter an even bigger loss in the loss function than the last adjustment. It’s like cramming

for a test you didn’t study for and then just randomly picking out random MCQ options in the

test.

However a slow learning rate allows the model to have more time and comprehend each feature

more comprehensively and lets the model have lower loss. The lower the learning rate the lower

the loss likely is gonna be after training and is most accurate. However a low learning rate can be

extremely slow to train and introduce a much heavier computational lead as it needs to do

exponentially more cycles of backpropagation than a regular learning rate.


77

Therefore we eventually trial-and-error with different values of to find the most optimal

learning rate which minimizes loss but at the same time gives us the fastest time it can be done

with the training and having to use lower computational resources.

You might also like