0% found this document useful (0 votes)
50 views50 pages

Linear Algebra Primer Concepts

Linear algebra concepts like vectors and spans are essential for data science and machine learning. Vectors represent data instances as points in an n-dimensional feature space, with each feature forming a dimension. Any vector can be represented as a linear combination of basis vectors, which are vectors chosen to represent the coordinate axes. The span of a set of vectors is the set of all possible linear combinations of those vectors, representing all points that can be reached. Linearly dependent vectors do not add to the span, while linearly independent vectors do add to the span by allowing access to new points. These linear algebra concepts provide mathematical foundations and intuitions important for data modeling.

Uploaded by

mayur barge
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
50 views50 pages

Linear Algebra Primer Concepts

Linear algebra concepts like vectors and spans are essential for data science and machine learning. Vectors represent data instances as points in an n-dimensional feature space, with each feature forming a dimension. Any vector can be represented as a linear combination of basis vectors, which are vectors chosen to represent the coordinate axes. The span of a set of vectors is the set of all possible linear combinations of those vectors, representing all points that can be reached. Linearly dependent vectors do not add to the span, while linearly independent vectors do add to the span by allowing access to new points. These linear algebra concepts provide mathematical foundations and intuitions important for data modeling.

Uploaded by

mayur barge
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

LINEAR ALGEBRA in Data Science and

AI

Shameek Bhattacharjee (Asst. Prof. WMU, Dept. of CS)

Scribe: Shourav Das (UG Student, WMU, Dept. of CS)

This material is part of NSF grant OAC-2017389


Vectors in Computer Science
Physics  Magnitude and direction y
Computer Science  An array of numbers (an ordered list of 𝑣𝑣 (1) = (𝑥𝑥1 , 𝑦𝑦1 )
numbers) 𝑣𝑣 (3) = (𝑥𝑥3 , 𝑦𝑦3 )
Mathematics identifying points in space, each element of a vector gives the
coordinate along one axis.
Vectors in CS is represented by lower case bold symbol say v
x
v = [𝑣𝑣1 … 𝑣𝑣𝑖𝑖 ⋯ 𝑣𝑣𝑛𝑛 ]
v represents a single point in a cartesian coordinate system of
n-dimensions. Mathematically  𝑣𝑣𝑖𝑖 ∈ 𝑅𝑅𝑛𝑛 𝑣𝑣 (2) = (𝑥𝑥2 , 𝑦𝑦2 )
(each element is in R, and the vector has n elements)
Each point can be represented as a vector  an arrow Figure : a
connecting origin (the tail) to the point (the tip)

In this figure (a), 𝑣𝑣 (1) is a vector in the an n=2 dimensional coordinate


system and in figure (b) it is in an n=3 dimensional coordinate system

Convention: In linear algebra, all vectors tail is fixed to the origin


of this coordinate system (different from physics)

Length of the arrow = magnitude of the vector 𝑣𝑣 (1)


Figure: b
Linear algebra in data science/AI/ML
y
Training Feature #1 Feature #2
example # 𝑣𝑣 (1) = (𝑥𝑥1 , 𝑦𝑦1 )
1 𝑥𝑥1 𝑦𝑦1 𝑣𝑣 (3) = (𝑥𝑥3 , 𝑦𝑦3 )
2 𝑥𝑥2 𝑦𝑦2
3 𝑥𝑥3 𝑦𝑦3 x

A huge dataset with n number of features 𝑣𝑣 (2) = (𝑥𝑥2 , 𝑦𝑦2 )


can be represented as points in an n-dimensional cartesian coordinate system
A vector of n elements is an n-dimensional vector with one dimension for each element.
(i) Visual intuition
(ii) other mathematical convenience (we will see later)

Conclusion: the coordinate axes could be equivalent to


Attributes/features/covariates/regressors/independent variables

The number of vectors you get is equal to the number of training data instances
Linear algebra in data science/AI/ML

Vectors do not just represent data. They also help represent our model. Many types
of Machine Learning models represent their learning as vectors. All types of neural
networks do this. Given some data, it will learn dense representations of that data.
These representations are essentially categories kin to recognize new given data.
Data

Weight Vector

New Vector
Basis Vectors and Linear Combinations
y Let 𝑥𝑥� = 10 and 𝑦𝑦� = 01
𝑣𝑣 (1) = (𝑥𝑥1 , 𝑦𝑦1 ) be a unit vector (of magnitude
𝑣𝑣 (3) = (−2,2) 1) aligned with the x and y axis

x
Terminology Alert #1: Such unit
vectors aligned with the axes i.e. 𝑥𝑥�
𝑣𝑣 (2) = (𝑥𝑥2 , 𝑦𝑦2 ) and 𝑦𝑦� are called “BASIS VECTORS”

Question: Now 𝑣𝑣 (1) can be represented in terms of 𝑥𝑥� and 𝑦𝑦�  How?

𝐴𝐴𝐴𝐴𝐴𝐴: 𝑣𝑣 (1) = (𝑥𝑥1 . 𝑥𝑥)


� + (𝑦𝑦1 . 𝑦𝑦)
� Conclusion:
𝑣𝑣 (3) = (-2. 𝑥𝑥)
� + (2.𝑦𝑦)
� Any vector can be represented
as a linear combination of its
Terminology Alert #2: The above representation is called a LINEAR COMBINATION basis vectors
Linear Combination in Data Science
Any training data instance can be represented as a linear combination

Example the instance 𝑣𝑣 (1) = (𝑥𝑥1 . 𝑥𝑥)


� + (𝑦𝑦1 . 𝑦𝑦)

Mathematical Meaning: Linear combination is obtained by stretching the 𝑥𝑥� and


𝑦𝑦� basis vectors with scalar values 𝑥𝑥1 and 𝑦𝑦1 (sum of two scaled unit vectors)

Physical meaning: 𝑣𝑣 (1) is composed of 𝑥𝑥1 parts of feature 𝑥𝑥� and 𝑦𝑦1 parts of
feature 𝑦𝑦�

Two alternatives to visualize multiple training data instances (the training set)
(i) Points in the feature space (the axes)
(ii) A list of vectors
Basis Vectors “Choice” could be arbitrary
Instead of unit vectors aligned with the axes,

We could have picked virtually any set of vector


as our basis vectors (e.g., 𝑣𝑣� and 𝑤𝑤)

and represent all other points in the dataset


as a linear combination of these two new basis
vectors 𝑣𝑣� and 𝑤𝑤

It will still work the same

Note 𝑣𝑣� and 𝑤𝑤


� not aligned with axes of the
original coordinate system

Terminology alert !
Conclusion: In 𝑅𝑅2 𝑥𝑥� = 10 and 𝑦𝑦� = 01 are called “standard
Basis vectors are a matter of basis” (which are also orthonormal i.e.,
choice. One can take liberties perpendicular to each other)
according to the nature of the
problem 𝑣𝑣� and 𝑤𝑤
� are orthonormal wrt to each other but
Not wrt to the standard basis
Understanding the Span
• Definition:
set of all linear combinations (nothing but points or vectors)
that you can potentially reach given a set of vectors

• Meaning:
Given any set of vectors (say two), what is the set of points can
you reach in this coordinate system? In 𝑅𝑅2 if no constraints are given,
the two standard basis vectors will produce a span equivalent to a 2D
plane sheet which is infinite. In reality though, often there are
constraints.
Illustration of Span

Question: in 𝑅𝑅3 the


standard basis 𝑥𝑥� = 10 and
𝑦𝑦� = 01 will give a span
equal to ?

Ans: A plane sheet cutting


through the origin.
Why we need span in Data Science
• Given a set of vectors, what can you do with them?
• You can add (subtract) or multiply by a scalar

• You're given a list of vectors, and told you can only play with these vectors.
• See all possibilities you can make with them. The set of all things you can make is the
span of those given vectors.

• That the span is a subspace (subset) is nice  reduces search space for one

• it's always good to have objects that are closed under certain operations, and
subspaces are just that: closed under vector addition and scalar multiplication.

• This isn't true for most generic sets of vectors, but definitely true for the span of a set
of vectors. So spans have nice properties.
Special case when given vectors line up

Note if you are given two vectors that


Line up, the set of all linear combinations
Addition scalar multiple, now will give you
SPAN = Just a line that is aligned with these two vectors only
not a 2D plane sheet
Reinforcing Linear Dependence and Span
y Suppose a third vector 𝑣𝑣� is on the span of your
previous two vectors 𝑥𝑥� and 𝑦𝑦�
𝑣𝑣� = (2,3)
How?  2 𝑥𝑥� + 3 𝑦𝑦� = 𝑣𝑣�

𝑣𝑣� does not add to the span


x (no new points can be reached )

Terminology Alert !
All such vectors 𝑣𝑣� are called linearly dependent
on the previous two vectors 𝑥𝑥� and 𝑦𝑦�

Linearly dependent vectors


(i) Do not add to the span
(ii) Can be expressed as a linear combination of
other vectors
Reinforcing Linear Independence and Span
Suppose a third vector 𝑣𝑣� is not on the span of your previous
two vectors 𝑥𝑥� and 𝑦𝑦�

𝑣𝑣� adds to the span


(a whole new points can be reached using 𝑥𝑥,
� 𝑦𝑦and
� 𝑣𝑣� )

Terminology Alert !
All such vectors 𝑣𝑣� are called linearly independent of the
previous two vectors 𝑥𝑥� and 𝑦𝑦�

Linearly independent vectors


𝑣𝑣� protruding up from the xy plane (i) Adds to the span
Unlocks another (third) dimension (ii) Cannot be expressed as a linear combination of other
vectors
(iii) Basis vectors need to be linearly independent to span
the whole vector space
SPAN and Linear Dependence in Data Science

If a vector is redundant and can be expressed as a combination of the


first two; i.e. linearly dependent 
I can ignore use of new variables while doing analysis
A form of reduction while making sense of big data with lot of points

If a vector is not the span and it expands the span of the previous two
vectors (adds a dimension), this kind of third vector is called as linearly
independent w.r.t the previous two vectors (because I cannot ignore
this third vector)
Linear Transformations

• Linear Transformation is essentially a function in linear algebra

• They take in a input vector x and produce an output vector y

• x (linear transformation)  y

• Geometry: Input vector moves over to its corresponding output 


a notion of bending the vector space
Transformation Contd..

• In linear algebra, transformation of the vector space is linear

• Meaning

1. origin remains the same before and after transformation

2. the grid lines of the vector space are parallel and evenly
spaced across either side of the transformation
Matrices

• Matrix  a way of packing information


𝑥𝑥=5
• i.e. taking in a vector 𝑦𝑦=7
𝑎𝑎
• I want to get an output or have an output vector 𝑏𝑏
• Find a matrix such that x. m11 + y. m21 = 𝑎𝑎
x. m12 + y. m22 =
𝑏𝑏
𝑚𝑚11 𝑚𝑚12 𝑚𝑚11 𝑚𝑚12
•5 𝑚𝑚21
+7 𝑚𝑚22
= 5 m11 + 7. m21 M= 𝑚𝑚21 𝑚𝑚22
5. m12 + 7. m22
Matrices for Transformation

• First column of the matrix M  where the first basis vector will land
after transformation

• Second column of the matrix M  where the second basis vector will
land after transformation

• Interpretation1
• Matrices can be transformation of basis vectors
Matrices
• Apart from interpreting matrices as linear transformations there is
another very important aspect

• Matrices are a compact way for storing data containing multiple


features (the columns) and huge number of training examples (rows)
Determinant of Matrix
Determinant of a matrix A (det A)

quantifies the factor by which the area changes (increases or decreases)


by a linear transformation specified by a matrix A

Det A = 0  if the transformation squishes the vectors onto a line or a point (in 2D) or a region with no volume

Det A is negative if the space/orientation is flipped

Det A is positive if the transformation cannot squish


vectors onto a line or a point or a
lower dimension compared to the input space
Understanding Det(A)=0
Since a matrix is a transformation, it causes an input vector to land on
some output vector.

Now if the determinant of the matrix (transformation) is zero, it means


that the area/volume of the transformed output vector space, is not
there.

When determinant of a matrix is zero, it means that the output vector


is a line, point, or a plane
Inverse of Matrices
Say 𝑥𝑥� is a vector of variables
A corresponds to some linear transformation that bends space
𝐴𝐴−1 𝐴𝐴𝐴𝐴 = 𝐴𝐴−1 𝑣𝑣
we are looking for a vector 𝑥𝑥� (nothing but a point) which after
transformation by matrix A lands on a pre-specified vector 𝑣𝑣�

𝑥𝑥 = 𝐴𝐴−1 𝑣𝑣
−1
𝐴𝐴 𝐴𝐴 = 𝐼𝐼

Playing a transformation in reverse with 𝑣𝑣� to see where it lands;


wherever it lands is 𝑥𝑥�
When det(𝐴𝐴) = 0, there is a no inverse
Some Intepretations of Matrix Transformations
𝐴𝐴𝐴𝐴 = 𝑣𝑣
Suppose you apply inputs 𝑥𝑥� in a system and observe an output 𝑣𝑣� . The
inherent nature of a system transforms the 𝑥𝑥� to 𝑣𝑣� . In such case, we can
solve for A from the Ax = v ; A may give how much each input feature
contributes to the observed output; a transfer function

Suppose you know the output 𝑣𝑣� in a system and know how the system
behaves specified by A. However, there are some uncertainties. Playing
transformation in reverse with 𝐴𝐴−1 and 𝑣𝑣� , one can get an approximate
idea of the values of the features next time
Rank of a matrix
• Solutions are harder to exist when the transformation squishes points onto
a lower dimension

• This interesting aspect has some fancy terminology known as RANK

• When output via a matrix transformation is a line (i.e. one dimensional),


we say that this matrix transformation has a RANK = 1

• Similarly, if the output via a matrix transformation is a 2D plane, then its


RANK = 2 and so on;
• In general, the term RANK says  the number of dimensions in the
“output” found from a matrix transformation
Rank and Column Spaces

• Set of all possible outputs for a matrix transformation is known as 


column space

• Remember that columns of a matrix (transformation) say  where


your basis vectors land after this transformation is used

• Span of transformed vectors above gives all possible outputs

• Column space is the span of the columns of your matrix


Null Spaces

• Set of vectors that land on the origin (zero vectors)  Null space

𝐴𝐴𝐴𝐴 = 𝑣𝑣 ⇒ 𝐴𝐴𝐴𝐴 = 0
When v happens to be a 0 vector 00 , the null space gives you all
possible solutions of the equation.
Dot products
Dot product of two vectors (𝑣𝑣. � =
� 𝑤𝑤)

length of “projection” of the 2nd vector (𝑤𝑤)


� onto the
first vector 𝑣𝑣�
x
length of the first vector (v)

Note: order does not matter on which vector is


projected

When two vectors are


generally pointing in the
same direction, their dot
product is positive
Dot products contd..

When two vectors are


generally pointing in the
opposite direction, their
dot product is negative
Dot products

When two vectors are


perpendicular

(Note this can be viewed


as neither same nor
opposite direction)

The dot product is zero


What can dot products tell  Indicate correlation
DUALITY of DOT products and Matrix vector
Multiplication
• Linear transformations  those which take in vectors in multiple
dimensions (say from 2D or above) and produce an output to 1D (a
single number on the real number line i.e. from vectors to numbers)

• This is similar to multiplying a 1x 2 matrix and a 2x 1 matrix which


gives a single number (much like matrix vector multiplication)

• 1x 2 matrices are analogical to 2D vectors  DUALITY

• Dot product is similar to matrix vector multiplication


Duality contd..

• Dual of a vector  the linear transformation that it encodes

• Dual of a (matrix) linear transformation in a one d space is a


certain vector in that one d space

• So vectors can be viewed as an embodiment of a linear


transformation, and not merely a single data point in a coordinate
system
CROSS PRODUCTS
Unlike dot products in cross products “order
Matters” 𝑣𝑣� 𝑋𝑋 𝑤𝑤 � = − 𝑤𝑤� X 𝑣𝑣�

V is on right of W (counter-clock rotation) 


area is positive
Negative otherwise
CROSS PRODUCTS

If V is on left of W (counter-
clock rotation) 
area is negative

So, 𝑣𝑣� 𝑋𝑋 𝑤𝑤 � X 𝑣𝑣�


� = − 𝑤𝑤
Compute Cross Product
For 2D cross product 𝑣𝑣� 𝑋𝑋 𝑤𝑤,
� we write
the coordinates of 𝑣𝑣� and 𝑤𝑤
� as the
first and second column of the matrix
respectively. Then we just compute
the determinant.

NOTE: Here the determinant


represents the factor by which
the area of this parallelogram
is changed.
Compute Cross Product contd..

For a 3D cross product between 𝑣𝑣� 𝑋𝑋 𝑤𝑤,


� the second and third columns of the
matrix contain the coordinates of 𝑣𝑣� and 𝑤𝑤
� respectively and the first column
contains the basis vectors. Then we just compute the determinant.
Cramer’s Rule
• A convenient method to solve a linear system of equations for just one single
variable without having to solve the whole system of equations.

• Let’s consider the following system of equations:


𝑎𝑎1 x + 𝑏𝑏1 y = 𝑐𝑐1
𝑎𝑎2 x + 𝑏𝑏2 y = 𝑐𝑐2
Let D be the determinant of the coefficient matrix and 𝐷𝐷𝑥𝑥 be the determinant
formed by replacing the x column with the constant column.

• Using Cramer’s rule:


𝑐𝑐1 𝑏𝑏1
𝐷𝐷𝑥𝑥 𝑐𝑐2 𝑏𝑏2
x= = 𝑎𝑎1 𝑏𝑏1 ,D≠0
𝐷𝐷
𝑎𝑎2 𝑏𝑏2
Cramer’s Rule contd..
• Similarly, while solving for y, the y column is replaced with the
constant column.

𝑎𝑎1 𝑐𝑐1
𝐷𝐷𝑦𝑦 𝑎𝑎2 𝑐𝑐2
y= = 𝑎𝑎1 𝑏𝑏1 ,D≠0
𝐷𝐷
𝑎𝑎2 𝑏𝑏2
Change of Basis
• A vector sitting in a 2D space can be
described with coordinates. We can think
each of the numbers as a scalar that
stretches or squishes vectors.

• If 𝚤𝚤̂ and 𝚥𝚥̂ are basis vectors, the first


coordinate scales 𝚤𝚤̂ and the second
coordinate scales 𝑗𝑗�.

Question: What if we used different basis vectors in a


different grid ??
Change of Basis contd..
• Space does not have a particular system of grid. So, someone might draw their own
grid in the space with a fixed origin.

• A vector in one grid(coordinate system) is different in another grid (coordinate


system) depending on the choice of the basis vectors.
• Now the question is: How do we translate between coordinate systems?
Change of Basis contd..
• Let’s say Mike has a different coordinate system than ours. In order to translate a vector
from Mike’s coordinate system to our coordinate system, we have to scale each of his basis
vectors by the corresponding coordinates of the vector of our system and add them
together.

Mike’s Our
coordinate coordinate
Eigenvector and Eigenvalue
• Let A be a square matrix. Then a
nonzero vector 𝑣𝑣⃗ is an
eigenvector of A if there exists a
scalar 𝜆𝜆 such that

A 𝑣𝑣⃗ = 𝜆𝜆 𝑣𝑣⃗

• The scalar 𝜆𝜆 is known as the


eigenvalue corresponding to the
eigenvector.
Eigenvector and Eigenvalue contd..

• During transformations, eigenvectors remain in their own span.

• A matrix can only stretch or squish these vectors like a scalar.

• The factor by which an eigenvectors gets stretched or squished is called


its corresponding eigenvalue.
Eigenvector and Eigenvalue contd..
• Question: Can eigenvalues be negative?
☞ Yes, eigenvalues can be negative. An eigenvector with an eigenvalue of −1⁄2 (the yellow
vector)means that the vector gets flipped and squished by a factor of 1⁄2 .
NOTE: Although the vector gets flipped and squished by a factor of 1⁄2 , it stays on
the same line in its span without getting rotated off of it.

Flipped and squished by ½


Eigendecomposition
• When we break any mathematical objects into their constituent parts or
find their properties, we can understand them better. For example, we can
understand the true nature of an integer when we decompose it into prime
factors.
• Similarly, when we decompose matrices, we can learn about their
functional properties which is not evident when we represent them as an
array of elements.
• One of the widely used matrix decompositions is eigendecomposition in
which we decompose a matrix into a set of eigenvectors and eigenvalues.
Eigendecomposition contd..
• Suppose that a square matrix A has n linearly independent eigenvectors,
{𝑣𝑣 1 , . . . ,𝑣𝑣 𝑛𝑛 }, with corresponding eigenvalues {λ1 , . . . , λ𝑛𝑛 }.

• We may concatenate all of the eigenvectors to form a matrix V with one


eigenvector per column: V = [𝑣𝑣 1 , . . . ,𝑣𝑣 𝑛𝑛 ].

• Similarly, we can concatenate the eigenvalues to form a vector


λ = [λ1 , . . . , λ𝑛𝑛 ]⏉ .

• If Λ is a diagonal matrix, then the eigendecomposition of A is given by:


A = V Λ 𝑽𝑽−1
Eigendecomposition contd..
• What does Eigendecomposition tell us about a matrix?
• A matrix is singular if and only if any of the eigenvalues are zero.
• If eigenvalues are all positive, then the matrix is called positive definite.
• If eigenvalues are all positive or zero-valued, then the matrix is called
positive semidefinite.
• If eigenvalues are all negative, then the matrix is called negative definite.
• If eigenvalues are all negative or zero-valued, then the matrix is called
negative semidefinite.

• [source: Deep Learning]


Singular Value Decomposition
• Eigendecomposition works only if a matrix is square. So when a matrix
is not square we use singular value decomposition.
• Singular value decomposition is a commonly used method for
decomposing a matrix into three other matrices.
• In other words, the singular value decomposition is the factorization of
an n ✕ m matrix A as the product A = UΣ𝑉𝑉 ⏉ where U and V are
orthogonal matrices and Σ is a diagonal matrix(NOT necessarily a
square matrix).
• The diagonal entries, σ1 ≥ σ2 ≥ …… σ𝑚𝑚 ≥ 0, are called singular values of
A. The columns of U are called the left-singular vectors and the columns
of V are called the right-singular vectors of A.
Singular Value Decomposition contd..
• We can actually interpret the singular value decomposition of A in terms
of the eigendecomposition of functions of A. The left-singular vectors of
A are the eigenvectors of 𝐴𝐴𝐴𝐴⏉ . The right-singular vectors of A are the
eigenvectors of 𝐴𝐴⏉ A. The non-zero singular values of A are the square
roots of the eigenvalues of 𝐴𝐴⏉ A. The same is true for 𝐴𝐴𝐴𝐴⏉ .

• One of the most useful features of the singular value decomposition is


that we can use it to partially generalize matrix inversion to non-square
matrices.
[source: Deep Learning]
Helpful Resources and References

This document uses some snapshot geometrical pictures from


youtube channel of 3blue1brown for geometry of linear algebra
https://www.youtube.com/channel/UCYO_jab_esuFRV4b17AJtAw
Please check it out for other geometric interpretations beyond AI and data science

Check out Linear algebra materials by Prof. Zico Kolter for mathematical formulaes and proofs
https://www.cs.cmu.edu/~zkolter/course/15-884/linalg-review.pdf

You might also like