Linear Algebra and Feature Selection - Course Notes
Linear Algebra and Feature Selection - Course Notes
Ivan Manov
Course Notes
Notes
365 DATA SCIENCE 2
Table of Contents
Abstract ................................................................................................................... 5
1.5 The Transpose of Vectors and Matrices, the Identity Matrix ................11
........................................................................................................................................14
........................................................................................................................................39
4.11 Analysis of the Training and Testing Times for the Classifier and its
Accuracy........................................................................................................................ 47
365 DATA SCIENCE 5
Abstract
specific frameworks rather than starting with the fundamentals, which leaves you with
knowledge gaps and a lack of full understanding. This course gives you an
opportunity to build a strong foundation that would allow you to grasp complex ML
and AI topics.
matrices, identity matrices, the linear span of vectors, and more. These are used in
of vectors, and calculating eigenvectors and eigenvalues, all preparing you for the
analysis, and machine learning. This isn’t surprising, as the ability to determine the
variety of problems – slow training time, the possibility of multicollinearity, the curse
help you avoid all these issues, by selecting the parts of the data which carry
Principal Components Analysis (PCA), and Linear Discriminant Analysis (LDA). These
methods transform the data you work with and create new features that carry most of
LDA
365 DATA SCIENCE 7
Linear algebra describes the concepts behind the machine learning algorithms
for dimensionality reduction. It builds upon vectors and matrices, linear equations,
the math on which algorithms are built, rather than someone who applies them
Some of the most important skills required for achieving this goal:
• covariance matrix
• comparing the performance of PCA and LDA for classification with SVMs
365 DATA SCIENCE 8
unknown variable. “Order”, in this case, refers to the highest power of the unknown
𝑎x 2 + bx + c = 0
The letters a, b and c are constant coefficients, while x is the unknown variable.
• the equation can have two distinct solutions, also called “roots of the
equation”
the equation has. The formula for the discriminant of this equation looks like this:
𝐷 = 𝑏 2 − 4ac
The discriminant is just a number computed by subtracting four times a times c
from the square of the coefficient b. Based on it, we decide on the number of
−𝑏 ± √𝐷
𝑥1,2 =
2𝑎
−𝑏
𝑥=
2𝑎
1.3 Vectors
arrows whose length is the magnitude of the vector, and its direction is whichever
Vector types:
• row vectors [2 1]
2
• column vectors [ ]
1
0 1 2 x
2
365 DATA SCIENCE 10
• addition
• subtraction
𝑝1 𝑞1
𝑝
𝑝 = [ 2] 𝑞
𝑞 = [ 2]
𝑝3 𝑞3
𝑝. 𝑞 = 𝑝1 × 𝑞1 + 𝑝2 × 𝑞2 + 𝑝3 × 𝑞3
Calculating a dot product
1.4 Matrices
rows or columns.
An example of a matrix
• addition (the shape of matrix A must be the same as the shape of matrix
B)
365 DATA SCIENCE 11
matrix B)
Matrix multiplication - taking the dot product between each row vector in A
Matrix multiplication
or vector.
Transpose of vectors - transforming row vectors into column ones and vice
versa.
columns.
1 5
1 2 4 𝑇
𝐴 =[ ] 𝐴 = [2 12]
5 12 6
4 6
Transposing of a matrix
365 DATA SCIENCE 12
Note: a non-square matrix changes its shape when we apply the transpose
function on it:
Identity matrix – a matrix that has ones on the diagonal and zeros elsewhere
1 0
𝐼2 = [ ]
0 1
2 x 2 Identity matrix
dimensional identity matrix with A gives us A again. That’s why it is called the
Amxn × 𝐼𝑛 = Amxn
𝜆1 𝑣1 + 𝜆2 𝑣2 + ⋯ + 𝜆𝑛 𝑣𝑛
𝑣 – vectors
𝜆 – real numbers
365 DATA SCIENCE 13
Linear span - the set of all possible linear combinations of these vectors
Each of the standard basis vectors has a single value one and zeros elsewhere.
The index of e indicates the position on which value one is located. If we’re in a two-
1 0
dimensional space, then 𝑒1 is the vector [ ], and 𝑒2 is [ ].
0 1
−1 1 0
𝛼 =[ ] 𝑒1 = [ ] 𝑒2 = [ ]
2 0 1
𝛼 = −1 × 𝑒1 + 2 × 𝑒2
Linear independence
A set of vectors consists of linearly independent vectors when none of them are
in the linear span of the rest vectors in this set. “Independent” means that not one
vector in the set is a multiple of another. “Linearly” is derived from the fact that we
perform linear combinations with the vectors in the rest of the set.
µ1 × 𝑣1 + µ2 × 𝑣2 + ⋯ + µ𝑛 × 𝑣𝑛 = 0
Linear dependence
µ1 × 𝑣1 + µ2 × 𝑣2 + ⋯ + µ𝑛 × 𝑣𝑛 = 0
Vector space –a set of vectors that can be added and subtracted together, as
Basis of a vector space - a set of vectors whose number equals the dimension
of that space, or in other words - a set of vectors that are linearly independent of each
other, and their linear span is the entirety of the vector space.
A basis means a set of vectors that form a vector space. Any vector in the
vector space is in the span of the basis, which means that in order to obtain any
vector from a certain vector space, it is sufficient to know its basis vectors, from which
we make a linear combination of them to achieve that. You can think of a basis of a
A determinant of a matrix - a number that you can obtain from any square
or not.
mathematical equation:
A matrix is invertible if it’s classified as a square matrix and its determinant does
not equal zero. On the other hand, if a matrix is non-square, meaning the number of its
365 DATA SCIENCE 15
rows does not match the number of its columns or its determinant is zero, then it is non-
invertible.
2 3
𝐻=[ ]
2 2
𝑑𝑒𝑡(𝐻) = 2 × 2 − 2 × 3 = −2
Determinant of a matrix
3. Calculating the inverse matrix with the help of the adjugate matrix
1 2 −3
𝐻 −1=𝑑𝑒𝑡(𝐻) × [ ]
−2 2
3
𝐻 −1 = [−1 2 ]
1 −1
Linear equations
𝑎𝑥 + 𝑏 = 𝑐
365 DATA SCIENCE 16
a, b, c – constant coefficients
x – variable to be determined
𝑐 − 𝑏
𝑥 =
𝑎
System of equations
𝑎1 𝑥1 + 𝑎2 𝑥2 + 𝑎3 𝑥3 = 𝑦1
𝑏1 𝑥1 + 𝑏2 𝑥2 + 𝑏3 𝑥3 = 𝑦2
𝑐1 𝑥1 + 𝑐2 𝑥2 + 𝑐3 𝑥3 = 𝑦3
𝐴𝑥 = y
𝐼𝑥 = 𝑥
𝑥 = 𝐴−1 y
method.
365 DATA SCIENCE 17
The Gauss method is a tool for solving linear equations. Also known as
as many as possible of the unknown variables in each row of the matrix, representing
Augmented matrix – a combination of the matrix 𝐴 and the vector y from the
equation 𝐴𝑥 = y.
Augmented matrix
The goal is to find the vector x of unknown variables 𝑥1 , 𝑥2 , and 𝑥3 . So, the best
scenario after the Gauss method is applied, is to obtain one of the three unknowns
directly from one of the rows in the system while eliminating the rest two of the
unknowns. Then, after we have one out of three variables found, we must substitute it
in one of the other two rows in the system so that we can find one of the two unknown
variables. In the end, we will use our third row to find the third and final unknown
1 −2 5 25 8 1 −2 5 25 8
−8 𝑟 → 𝑟 + 𝑟 0| 0]
[−1 2 −5 −25| ] → [0 0 0
2 2 1
2 −4 10 50 16 𝑟3 → 𝑟3 − 2 × 𝑟1 0 0 0 0 0
3 −6 15 75 24 3 −6 15 75 24
365 DATA SCIENCE 18
1 −2 5 25 8 1 −2 5 25 8
[0 0 0 0 0
| ] 𝑟 → 𝑟4 − 3 × 𝑟1 → [0 0 0 0 | 0]
0 0 0 0 0 4 0 0 0 0 0
3 −6 15 75 24 0 0 0 0 0
x1 − 2 × x2 + 5 × x3 + 25 × x4 = 8
Cases we can deal with regarding the solution of the equation: Ax=b:
• the equation has a unique solution - a single vector that solves the
equation
• the equation has a general solution - several (or infinitely many) vectors
All these cases are determined after performing the Gauss method.
With the Gauss method’s help, we can determine whether a given set of
• find the unknown variables in the equations and estimate the solution
vector
when we apply a certain linear transformation which, in turn, is the multiplication by its
matrix.
y
365 DATA SCIENCE 20
Any transformation performed onto that vector is described via a matrix – the
rotation matrix.
cos(𝜃) − sin(𝜃)
[ ]
sin(𝜃) 𝑐𝑜𝑠(𝜃)
Eigenvalue - the scalar factor by which the matrix scales the eigenvector.
a b
𝐴=[ ]
c d
λ – Eigenvalue
a matrix
𝑑𝑒𝑡(A − λ𝐼) = 0
A – Matrix A
I – Identity matrix
365 DATA SCIENCE 21
λ – Eigenvalue
matrix A.
Av = λv
(A − λ𝐼)v = 𝟎
eigenvector
• apply the Gauss method using the same approach as with the equation
Ax = b
Definitions:
various types of machine learning models in a way that leaves the predictive ability of
engineering”
Feature engineering - the stage which prepares the input dataset and makes it
Features
House 1 87 3 6
House 2 94 4 7
Observations (samples)
High-dimensional data often lead to various problems which can impact the
performance of the machine learning algorithm, making it unreliable. A set with many
365 DATA SCIENCE 24
features probably means that the volume is large, but the data points are very few
and far apart and this is problematic for the algorithm. In such cases, we could try to
remove the irrelevant features without losing essential information from the dataset.
Reducing the features can boost the training accuracy of the model, hence its
predictive power. By applying feature selection, we examine the key features that
large, while the data points in it are few and far apart
Feature extraction – the process of transforming existing features into new ones
When applying feature selection, working with a subset of the original data
indicates that we’ll have fewer features compared to the original dataset, hence the
dimensional space would be lower. In feature extraction, although the number of the
newly constructed features might be the same as the original, we usually use only the
most significant ones. In particular, evaluating the retained variance in the dataset
helps us select those features that preserve the most information for our data.
The phenomenon describes a set of problems that arise when working with a
mathematician Richard Bellman and refers to the difficulties we face when optimizing
a function with many input variables. That said, the error can increase when adding
Richard Bellman
model
combinations of features
reduction
reduction to close up the feature space. This method contains no additional inputs,
which makes the data analysis a straightforward process for the machine learning
algorithm.
365 DATA SCIENCE 27
algorithm used for dimensionality reduction. It is one of the oldest and most widely
reduce the number of features in a dataset into a smaller subset which preserves
x
z
Three-dimensional dataset
The PCA algorithm constructs new axes, called principal components (PCs),
whose number equals the number of variables in our initial dataset. They are not the
same as the original features - these principal components capture the most variance
365 DATA SCIENCE 28
of the data in each direction. The goal is to capture the most valuable information in
the data.
Two-dimensional dataset
These axes are ordered by the amount of retained variance in the data.
Therefore, the first few components carry the most information about our data. That
1 2 3
Principal Components
365 DATA SCIENCE 29
Retaining at least 80% of the original variance, usually means we’ve kept the
are normally interested in calculating and plotting metrics, such as variance captured
by each separate component, as well as cumulative variance. Plotting these can help
us decide how many of the principal components we are going to use and which
ones to discard. After we make this decision and we project our standardized data
onto the new axes (the PCs), we lower the dimension of our data, so dimensionality
reduction occurs.
are constructed in a way that makes the next component perpendicular to the
previous ones. We want axes that separate the points clearly, possibly distinguishing
the groups of points better than the original axes do. PCA will find these axes as the
Note: when the data we work with has only two explanatory variables, we can
decide to take both components produced by PCA. That won’t lower the dimension,
When our data is higher dimensional, after all the PCs have been constructed
by PCA, if we want to lower the dimension, we need to project our data points onto а
subset of them. This subset will be carrying enough variance for the data. The
dimension is lower because we have chosen only the first few PCs as mentioned
before. This way our new axes are less than the original ones.
365 DATA SCIENCE 30
Here, we concentrate on the main steps describing the process, rather than analysing
the code. The code itself can be fully observed in the course content.
California_Real_Estate_CSV.csv
The dataset contains 8 variables, 6 of which are numerical. PCA works with
The NaN values which appear after loading the data in Python must be
discarded.
365 DATA SCIENCE 31
Mean = 0
Standard deviation = 1
component
Covariance matrix – a matrix that shows the correlation between any pair of
If our data has “n” dimensions the associated covariance matrix will have shape
nₓn.
𝐶𝑜𝑣(𝑋1 , 𝑋 2 ) = 𝐶𝑜𝑣(𝑋 2 , 𝑋1 )
1
𝐶𝑜𝑣(𝑋ᵢ, 𝑋ₖ) = 𝑛−1 ∑𝑛𝑚=0(𝑋ₘ − 𝑋̅)(Yₘ−𝑌̅)
Eigendecomposition
𝑑𝑒𝑡(𝐶 − µ𝐼) = 0
C – covariance matrix
I – identity matrix
Having obtained the eigenvalues, we solve the linear system 𝐂𝐱 = µ𝐱 for every
eigenvalue µ and we find the eigenvector x for the corresponding eigenvalue µ. After
determining all eigenvector- eigenvalue pairs, we can calculate the variance carried
in each principal component from the eigenvalues. By ranking our eigenvectors for
component, we divide the eigenvalue of the respective component by the sum of all
eigenvalues.
must interpret these components and their relationship to the original features. Since
each component is a combination of their initial variables, we must find those with the
biggest weights in the equation of each component, or, in other words, which
variables
The length of the overall and the class means corresponds to the number of
features.
➔ The overall and the class means vectors will be of length two.
Overall mean
𝑥
[ ] the first entry is the overall mean for the 1st feature – the Math exam
𝑥
scores
𝑥
[ ] the second entry is the overall mean of the 2nd feature – the English
𝑥
exam scores
365 DATA SCIENCE 35
Class means
56 50+62 31+75
Mean (male) = [53] = 56 = 53
2 2
77 73+81 80+90
Mean (female) = [85] = 77 = 85
2 2
goal of LDA is to find a linear combination of features to best separate two or more
Ronald Fisher
365 DATA SCIENCE 36
When performed, LDA constructs new axes, called linear discriminants, where
the data is being projected. In LDA, the number of linear discriminants is, at most, c-1,
constructing axes that better separate the data points according to their labels and
LDA Overview
When applied, LDA finds the between- and within-class scatter matrices by
using the class means and the overall data mean. Then, LDA finds the eigenvalues
and the eigenvectors of the matrix obtained from multiplying the inverse of 𝑆𝑊 with
365 DATA SCIENCE 37
𝑆𝐵 . These eigenvectors are the linear discriminants, or the axes of our low-
dimensional space.
2
(µ1 −µ2 )
- Fisher’s Discriminant Ratio
𝑠12 +𝑠22
Fisher’s Discriminant Ratio represents the idea behind LDA: maximizing the
distance between the class means, while minimizing the scatter. Considering the ratio,
• Maximizing the distance between each class mean and the centroid
𝑑12 – the distance between the centroid and the mean of the class 1
c – the class
With the between-class scatter matrix we measure how far away the single class
clusters are.
𝑆𝑤 = ∑ 𝑆𝑖
𝑖=1
𝑆𝑖 = ∑(x − mi )(x − mi )𝑇
x𝜖𝐶𝑖
The within-scatter matrix represents the sum of the covariance matrices for each
separate class. Each class covariance matrix shows the relations between the features
in this class.
𝑎𝑟𝑔𝑚𝑎𝑥 𝑣 𝑇 𝑆𝐵 𝑣
𝑤
̃=
𝑣 𝑣 𝑇 𝑆𝑤 𝑣
𝑆𝑤
This ratio can also represent the eigenvalue problem given by the following formula:
365 DATA SCIENCE 39
𝑆𝑤 𝑣 = µ𝑆𝐵 𝑣
matrix, produced by the multiplication between the inverse within-scatter matrix with
After finding the eigenpairs, we must arrange the eigenvalues from largest to
discriminants we will use when performing LDA depends on the selecting of the most
Here, we concentrate on the main steps describing the process, rather than analyzing
the code. The code itself can be fully observed in the course content.
winequality.csv
Since our data is numerical, one thing we should analyze is the range of values
each column takes. In particular, the quality column that shows how many different
grades for wine quality we have, as well as what actual grades there are.
Our goal will be to predict the wine grades, which means that this will be our
predictor column.
We see that the quality ranges from 3 to 8, therefore there are 6 distinct classes
The data consists of 12 columns overall, 11 of which are feature columns. The
last one, called quality, is the column that represents the data labels. Our goal is to
separate the wine samples from the data and group them into clusters, according to
Each row in the data represents a particular wine, while each column describes
The quality grades have no numerical influence on the data, instead they
• Calculating the class mean – the class mean for each separate quality
grade of wine
find the product’ eigenvalues and eigenvectors of the inverse of the within-scatter
matrix and the between-scatter matrix, so that we can make a feature selection and
• Calculating the covariance matrix for each separate quality grade and
𝑆𝑤 = ∑ 𝑆𝑖
𝑖=1
between-scatter matrix
𝑆𝑤−1 𝑆𝐵
365 DATA SCIENCE 43
construct the linear discriminants - our new axes. These new axes will comprise our
lower-dimensional space onto which we’ll project the data points. In Python, this is a
straightforward process.
• First, by using the method .eig() of the submodule linalg in NumPy and
and eigenvectors
eigenvalues and multiply that by 100. This will give us the percentage of
To find the newly constructed axes’ significance, we must add all eigenvalues
and divide each of them separately by the overall sum to obtain the percentage of
retained variance.
We must project the data onto the discriminants that make up more than 80%
of the total variance. That’s where dimensionality reduction occurs– instead of the
original number of features, we’ll have less (in this case – two) new linear discriminants
In the observed practical example, the first two linear discriminants make up
around 95% of the total variance in the data. So, instead of the original eleven
features, we’ll have our two new linear discriminants that correspond to the two
• Splitting the dataset into training and testing parts and applying apply
• After the data is projected onto the linear discriminants in the case of
LDA, and onto the principal components in the case of PCA - training
• Timing the classifier for the training and testing part for both LDA and
PCA
• Running the functions which train and test the classifier on the projected
Analysing the results confirm that there is a much better separation of the data
Why do we split the initial features dataset into training and testing
We must first start with discussing the standardization tool - “Standard Scaler”
computes its mean and standard deviation. These will then be used in
the formula to rescale each feature. In the case of the Standard Scaler
of 1.
transforms them. In this case the resulting features have a zero mean
Answer: using .fit transform() on the train data results in standardizing it. The
method calculates the mean and standard deviation for each feature. We use
.transform() on the test data only. This means that we’re applying a formula where µ is
365 DATA SCIENCE 46
the mean of the training set for each feature, while σ is the standard deviation of the
training set of each feature. Thus, we apply the metrics calculated from the training
set onto the test set. The testing set aims to serve as unseen data, to model real-life
scenarios.
4.11 Analysis of the Training and Testing Times for the Classifier and its
Accuracy
Ideally, we’d like to time our code for both LDA and PCA and figure out how
accurate each analysis is. To make it more precise, we can run the functions for
training and testing PCA and LDA a grand total number of ten times and take the
The results for training times, as well as the predicting ones, are very close.
However, the real metric of importance is the classifier’s accuracy. But before that, we
Confusion matrix - a square matrix that shows which samples were graded
correctly and which were not. In our case, it shows us the number of wines that we
correctly identified in class 3, as well as the number of wines, which are in class 3, but
we’ve misclassified. The same goes for the wines in class 4 through 8, as well. We use
the confusion matrix because it helps us improve our model by analysing the
misclassified data and adjusting the parameters of the model accordingly. It shows a
measurement.
After the confusion matrix is ready, we’ll measure the accuracy by calling
Results:
LDA outperformed PCA with 9%. As expected, LDA also did much better when
Author Name
365 DATA SCIENCE 49
Aleksandar Samsiev
Ivan Manov
Email:
Email: [email protected]
Address: