0% found this document useful (0 votes)
93 views

Linear Algebra and Feature Selection - Course Notes

This document provides an overview and table of contents for a course on linear algebra and feature selection. The course begins with introductions to basic linear algebra concepts like vectors, matrices, eigenvalues, and eigenvectors. It then discusses dimensionality reduction and its importance in data science. The document focuses on explaining two dimensionality reduction techniques: Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA).

Uploaded by

Helen Mylona
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
93 views

Linear Algebra and Feature Selection - Course Notes

This document provides an overview and table of contents for a course on linear algebra and feature selection. The course begins with introductions to basic linear algebra concepts like vectors, matrices, eigenvalues, and eigenvectors. It then discusses dimensionality reduction and its importance in data science. The document focuses on explaining two dimensionality reduction techniques: Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA).

Uploaded by

Helen Mylona
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

Aleksandar Samsiev

Ivan Manov

Linear Algebra and


Feature Selection

Course Notes
Notes
365 DATA SCIENCE 2

Table of Contents

Abstract ................................................................................................................... 5

Section 1: Linear Algebra Essentials .................................................................... 7

1.1 Why Linear Algebra? ................................................................................... 7

1.2 Solving Quadratic Equations ...................................................................... 8

1.3 Vectors ........................................................................................................... 9

1.4 Matrices .......................................................................................................10

1.5 The Transpose of Vectors and Matrices, the Identity Matrix ................11

1.6 Linear Independence and Linear Span of Vectors ................................ 12

1.7 Basis of a Vector space, Determinant of a Matrix and Inverse of a Matrix

........................................................................................................................................14

1.8 Solving Equations of the Form Ax = b ..................................................... 15

1.9 The Gauss Method ..................................................................................... 17

1.10 Other Types of Solutions of the Equation Ax = b ................................ 18

1.11 Determining Linear Independence of a Random Set of Vectors .......18

1.12 Eigenvalues and Eigenvectors ............................................................... 19

1.13 Calculating Eigenvalues ..........................................................................20

1.14 Calculating Eigenvectors ........................................................................21

Section 2: Dimensionality Reduction Motivation .............................................23


365 DATA SCIENCE 3

2.1 Feature Selection, Feature Extraction, and Dimensionality Reduction23

2.2 The Curse of Dimensionality ....................................................................24

Section 3: Principal Component Analysis (PCA) ..............................................27

3.1 An Overview of PCA .................................................................................. 27

3.2 Step-by-step Explanation of PCA Through California Estates Example30

3.3 The Theory Behind PCA ............................................................................32

3.4 PCA Covariance Matrix in Jupyter – Analysis and Interpretation .........33

Section 4: Linear Discriminant Analysis (LDA) .................................................. 34

4.1 Overall Mean and Class Means ................................................................ 34

4.2 An Overview of LDA .................................................................................. 35

4.3 LDA – Calculating Between- And Within-class Scatter Matrices ..........36

4.4 Step-by-step Explanation of Performing LDA on the Wine-quality Dataset

........................................................................................................................................39

4.5 Calculating the Within and Between-Class Scatter Matrices ................41

4.6 Calculating Eigenvectors and Eigenvalues for the LDA ....................... 43

4.7 Analysis of LDA ........................................................................................... 43

4.8 LDA vs PCA .................................................................................................44

4.9 Setting up the Classifier to Compare LDA and PCA ............................. 45

4.10 Coding the Classifier for LDA and PCA ................................................46


365 DATA SCIENCE 4

4.11 Analysis of the Training and Testing Times for the Classifier and its

Accuracy........................................................................................................................ 47
365 DATA SCIENCE 5

Abstract

Linear algebra is often overlooked in data science courses, despite being of

paramount importance. Most instructors tend to focus on the practical application of

specific frameworks rather than starting with the fundamentals, which leaves you with

knowledge gaps and a lack of full understanding. This course gives you an

opportunity to build a strong foundation that would allow you to grasp complex ML

and AI topics.

The course starts by introducing basic algebra notions such as vectors,

matrices, identity matrices, the linear span of vectors, and more. These are used in

solving practical linear equations, determining linear independence of a random set

of vectors, and calculating eigenvectors and eigenvalues, all preparing you for the

machine learning part of the matter - dimensionality reduction.

The concept of dimensionality reduction is crucial in data science, statistical

analysis, and machine learning. This isn’t surprising, as the ability to determine the

important features in a dataset is essential - especially in today’s data-driven age

when one must be able to work with very large datasets.

Having hundreds or even thousands of attributes in your data could lead to a

variety of problems – slow training time, the possibility of multicollinearity, the curse

of dimensionality, or even overfitting the training data. Dimensionality reduction can

help you avoid all these issues, by selecting the parts of the data which carry

important information and disregarding the less impactful ones.


365 DATA SCIENCE 6

This course observes two staple techniques for dimensionality reduction –

Principal Components Analysis (PCA), and Linear Discriminant Analysis (LDA). These

methods transform the data you work with and create new features that carry most of

the variance related to a given dataset.

Keywords: linear algebra, dimensionality reduction, feature selection, PCA,

LDA
365 DATA SCIENCE 7

Section 1: Linear Algebra Essentials

Linear algebra describes the concepts behind the machine learning algorithms

for dimensionality reduction. It builds upon vectors and matrices, linear equations,

eigenvalues and eigenvectors, and more.

1.1 Why Linear Algebra?

Knowing linear algebra allows you to become a professional who understands

the math on which algorithms are built, rather than someone who applies them

blindly without knowing what happens behind the scenes.

Some of the most important skills required for achieving this goal:

• basic and advanced linear algebra notions

• solving linear equations

• determining independency of a set of vectors

• calculating eigenvalues and eigenvectors

• covariance matrix

• applying Principal Components Analysis

• applying Linear Discriminant Analysis

• performing dimensionality reduction in Python

• comparing the performance of PCA and LDA for classification with SVMs
365 DATA SCIENCE 8

1.2 Solving Quadratic Equations

By definition, a quadratic equation is an equation of second order with one

unknown variable. “Order”, in this case, refers to the highest power of the unknown

variable in the equation. In our case, that’s two.

A quadratic equation in its general form:

𝑎x 2 + bx + c = 0
The letters a, b and c are constant coefficients, while x is the unknown variable.

Case scenarios regarding the number of possible solutions:

• the equation can have two distinct solutions, also called “roots of the

equation”

• the equation can have a single solution – a double root

• the equation has no existing solutions at all

Based on a number called ‘discriminant’ we decide on the number of solutions

the equation has. The formula for the discriminant of this equation looks like this:

 𝐷 =  𝑏 2 − 4ac
The discriminant is just a number computed by subtracting four times a times c

from the square of the coefficient b. Based on it, we decide on the number of

solutions the equation has.

• if the discriminant is a positive number, we have two distinct solutions

−𝑏 ± √𝐷
𝑥1,2 =
2𝑎

• if D equals zero, the quadratic equation has a single solution


365 DATA SCIENCE 9

−𝑏
𝑥=
2𝑎

• if the discriminant is a negative number, then there are no solutions to

the quadratic equation

1.3 Vectors

A vector is a one-dimensional object, characterized by magnitude and

direction, containing numbers as elements. Geometrically, vectors are denoted by

arrows whose length is the magnitude of the vector, and its direction is whichever

way the vector is pointing to.

Vector types:

• row vectors [2 1]

2
• column vectors [ ]
1

0 1 2 x

Geometrical representation of a vector

2
365 DATA SCIENCE 10

Algebraic operations we can perform with vectors:

• addition

• subtraction

• calculating the dot product

Dot product – a scalar number obtained by executing a specific operation on

the vector components:

𝑝1 𝑞1
𝑝
𝑝 = [ 2] 𝑞
𝑞 = [ 2]
𝑝3 𝑞3

 𝑝. 𝑞 =  𝑝1 × 𝑞1 + 𝑝2 × 𝑞2 + 𝑝3 × 𝑞3
Calculating a dot product

1.4 Matrices

Matrices are 2-dimensional objects, representing a collection of numbers in

rows and columns. Matrices can be considered as a collection of vectors ordered as

rows or columns.

𝑎11 𝑎12 𝑎13


𝑎
𝐴 = [ 21 𝑎22 𝑎23 ]
𝑎31 𝑎32 𝑎33

An example of a matrix

Algebraic operations we can perform with matrices:

• addition (the shape of matrix A must be the same as the shape of matrix

B)
365 DATA SCIENCE 11

• subtraction (the shape of matrix A must be the same as the shape of

matrix B)

• multiplication (the number of columns in matrix A must match the

number of rows in matrix B)

Matrix multiplication - taking the dot product between each row vector in A

and each column vector in B:

𝑎11 × 𝑏11 + 𝑎12 × 𝑏21 + 𝑎11 × 𝑏11 + 𝑎12 × 𝑏22


 𝐴 × B = [ ]
𝑎21 × 𝑏11 + 𝑎22 × 𝑏21 + 𝑎21 × 𝑏12 + 𝑎22 × 𝑏22

Matrix multiplication

1.5 The Transpose of Vectors and Matrices, the Identity Matrix

Transposition - takes a matrix or a vector and transforms it into another matrix

or vector.

Transpose of vectors - transforming row vectors into column ones and vice

versa.

Transpose of matrices – transforming the row vectors in the matrix into

columns.

1 5
1 2 4 𝑇
𝐴 =[ ] 𝐴 = [2 12]
5 12 6
4 6

Transposing of a matrix
365 DATA SCIENCE 12

𝐴𝑇 – the transpose matrix

Note: a non-square matrix changes its shape when we apply the transpose

function on it:

Multiplying a scalar by a vector or a matrix - we scale vectors and matrices by

scaling each of their elements by this scalar

Identity matrix – a matrix that has ones on the diagonal and zeros elsewhere

1 0
𝐼2 = [ ]
0 1

2 x 2 Identity matrix

If we have a m-by-n matrix A and multiply it by the n-dimensional identity

matrix 𝐼𝑛 , then we simply obtain A once again. Similarly, multiplying the m

dimensional identity matrix with A gives us A again. That’s why it is called the

“identity matrix” – it acts as one in the integers.

Amxn × 𝐼𝑛 = Amxn  

 𝐼𝑚 × Amxn = Amxn

1.6 Linear Independence and Linear Span of Vectors

Linear combination of set of vectors:

𝜆1 𝑣1 + 𝜆2 𝑣2 + ⋯ + 𝜆𝑛 𝑣𝑛

𝑣 – vectors

𝜆 – real numbers
365 DATA SCIENCE 13

Linear span - the set of all possible linear combinations of these vectors

Standard basis vector

Each of the standard basis vectors has a single value one and zeros elsewhere.

The index of e indicates the position on which value one is located. If we’re in a two-

1 0
dimensional space, then 𝑒1 is the vector [ ], and 𝑒2 is [ ].
0 1

−1 1 0
𝛼 =[ ] 𝑒1 = [ ] 𝑒2 = [ ]
2 0 1

𝛼 = −1 × 𝑒1 + 2 × 𝑒2

𝛼 − 𝑎 𝑙𝑖𝑛𝑒𝑎𝑟 𝑐𝑜𝑚𝑏𝑖𝑛𝑎𝑡𝑖𝑜𝑛 𝑜𝑓 𝑒1 𝑎𝑛𝑑 𝑒2

Linear independence

A set of vectors consists of linearly independent vectors when none of them are

in the linear span of the rest vectors in this set. “Independent” means that not one

vector in the set is a multiple of another. “Linearly” is derived from the fact that we

perform linear combinations with the vectors in the rest of the set.

µ1 × 𝑣1 + µ2 × 𝑣2 + ⋯ + µ𝑛 × 𝑣𝑛 = 0

Linear independent set of vectors

Linear dependence

A set of vectors is linearly dependent if there is a linear combination of them

with non-zero coefficients that equals zero.

µ1 × 𝑣1 + µ2 × 𝑣2 + ⋯ + µ𝑛 × 𝑣𝑛 = 0

Linear dependent set of vectors


365 DATA SCIENCE 14

1.7 Basis of a Vector space, Determinant of a Matrix and Inverse of a Matrix

Vector space –a set of vectors that can be added and subtracted together, as

well as multiplied by numbers, called scalars.

Basis of a vector space - a set of vectors whose number equals the dimension

of that space, or in other words - a set of vectors that are linearly independent of each

other, and their linear span is the entirety of the vector space.

A basis means a set of vectors that form a vector space. Any vector in the

vector space is in the span of the basis, which means that in order to obtain any

vector from a certain vector space, it is sufficient to know its basis vectors, from which

we make a linear combination of them to achieve that. You can think of a basis of a

vector space as the smallest set of vectors that generates it.

A determinant of a matrix - a number that you can obtain from any square

matrix. It characterizes some matrix properties – for example, whether it is invertible

or not.

Note: not all matrices can be inverted.

If a matrix can be inverted, then we express this with the following

mathematical equation:

A × A−1 = 𝐼 and A−1 × A = 𝐼 where 𝑰 is the 𝒊𝒅𝒆𝒏𝒕𝒊𝒕𝒚 𝒎𝒂𝒕𝒓𝒊𝒙

A matrix is invertible if it’s classified as a square matrix and its determinant does

not equal zero. On the other hand, if a matrix is non-square, meaning the number of its
365 DATA SCIENCE 15

rows does not match the number of its columns or its determinant is zero, then it is non-

invertible.

Finding the inverse of a matrix:

1. Calculating its determinant by multiplying the entries on both diagonals

and, afterward, subtracting the results from each other

2 3
𝐻=[ ]
2 2
𝑑𝑒𝑡(𝐻) = 2 × 2 − 2 × 3 = −2
Determinant of a matrix

2. Filling the entries in the adjugate matrix using the technique of

crossing off rows and columns

3. Calculating the inverse matrix with the help of the adjugate matrix

and the determinant

1 2 −3
𝐻 −1=𝑑𝑒𝑡(𝐻) × [ ]
−2 2

3
𝐻 −1 = [−1 2 ]
1 −1

Calculating the inverse matrix

1.8 Solving Equations of the Form Ax = b

Linear equations

𝑎𝑥 + 𝑏 = 𝑐
365 DATA SCIENCE 16

a, b, c – constant coefficients

x – variable to be determined

𝑐 − 𝑏
𝑥 =
𝑎

System of equations

𝑎1 𝑥1 + 𝑎2 𝑥2 + 𝑎3 𝑥3 = 𝑦1

𝑏1 𝑥1 + 𝑏2 𝑥2 + 𝑏3 𝑥3 = 𝑦2

𝑐1 𝑥1 + 𝑐2 𝑥2 + 𝑐3 𝑥3 = 𝑦3

This system can be represented as multiplication of matrices:

𝐴𝑥 = y

𝑎11 𝑎12 𝑎13 𝑥1 𝑦1


𝐴 = [𝑎21 𝑎22 𝑎23 ] 𝑥 = [𝑥2 ] y = [𝑦2 ]
𝑎31 𝑎32 𝑎33 𝑥3 𝑦3

Note: if a matrix A is invertible, we can multiply both sides of our equation

by A inverse from the left: 𝐴−1 × 𝐴𝑥 = 𝐴−1 × y but we cannot multiply by A

inverse from the right: 𝐴𝑥 × 𝐴−1 = y × 𝐴−1 .

𝐴−1 × 𝐴𝑥 = 𝐴−1 y, but 𝐴−1 × 𝐴 = 𝐼 – the identity matrix

 𝐼𝑥 = 𝑥

 𝑥 = 𝐴−1 y

If the matrix 𝐴 is non-invertible, we use a different approach – the Gauss

method.
365 DATA SCIENCE 17

1.9 The Gauss Method

The Gauss method is a tool for solving linear equations. Also known as

“Gaussian elimination”, the method involves algebraic operations, aiming to eliminate

as many as possible of the unknown variables in each row of the matrix, representing

a system of linear equations.

Augmented matrix – a combination of the matrix 𝐴 and the vector y from the

equation 𝐴𝑥 = y.

𝑎11 𝑎12 𝑎13 𝑦1


[𝑎21 𝑎21 𝑎23 | 𝑦2 ]
𝑎31 𝑎32 𝑎33 𝑦3

Augmented matrix

The goal is to find the vector x of unknown variables 𝑥1 , 𝑥2 , and 𝑥3 . So, the best

scenario after the Gauss method is applied, is to obtain one of the three unknowns

directly from one of the rows in the system while eliminating the rest two of the

unknowns. Then, after we have one out of three variables found, we must substitute it

in one of the other two rows in the system so that we can find one of the two unknown

variables. In the end, we will use our third row to find the third and final unknown

variable, using the two already known ones.

Gaussian elimination example:

1 −2 5 25 8 1 −2 5 25 8
−8 𝑟 → 𝑟 + 𝑟 0| 0]
[−1 2 −5 −25| ] → [0 0 0
2 2 1
2 −4 10 50 16 𝑟3 → 𝑟3 − 2 × 𝑟1 0 0 0 0 0
3 −6 15 75 24 3 −6 15 75 24
365 DATA SCIENCE 18

1 −2 5 25 8 1 −2 5 25 8
[0 0 0 0 0
| ] 𝑟 → 𝑟4 − 3 × 𝑟1 → [0 0 0 0 | 0]
0 0 0 0 0 4 0 0 0 0 0
3 −6 15 75 24 0 0 0 0 0

Hence solution vector x becomes:

x1 − 2 × x2 + 5 × x3 + 25 × x4 = 8

1.10 Other Types of Solutions of the Equation Ax = b

Cases we can deal with regarding the solution of the equation: Ax=b:

• the equation has a unique solution - a single vector that solves the

equation

• the equation has a general solution - several (or infinitely many) vectors

that solve the equation

• the equation has no solutions

All these cases are determined after performing the Gauss method.

1.11 Determining Linear Independence of a Random Set of Vectors

With the Gauss method’s help, we can determine whether a given set of

vectors is linearly independent or not. For this purpose, we must:

• build the augmented matrix


365 DATA SCIENCE 19

• perform Gaussian elimination

• transform the augmented matrix back to linear system form

• find the unknown variables in the equations and estimate the solution

vector

• determine whether the set of vectors is linearly independent or not

based on values in the solution vector

1.12 Eigenvalues and Eigenvectors

Eigenvectors of a matrix - non-zero vectors that can change by a scalar factor

when we apply a certain linear transformation which, in turn, is the multiplication by its

matrix.

y
365 DATA SCIENCE 20

Any transformation performed onto that vector is described via a matrix – the

rotation matrix.

The rotation matrix

cos(𝜃) − sin(𝜃)
[ ]
sin(𝜃) 𝑐𝑜𝑠(𝜃)

Eigenvalue - the scalar factor by which the matrix scales the eigenvector.

a b
𝐴=[ ]
c d

v – Eigenvector of the matrix A

λ – Eigenvalue

Av = λv – characteristic equation – an equation for obtaining the eigenvalues of

a matrix

1.13 Calculating Eigenvalues

To obtain the eigenvalues of a matrix, we must solve an equation called the

“characteristic equation”: Av = λv. Its solutions represent the eigenvalues of the

matrix. The characteristic equation is obtained from the following equation:

𝑑𝑒𝑡(A − λ𝐼) = 0

A – Matrix A

I – Identity matrix
365 DATA SCIENCE 21

λ – Eigenvalue

The solutions of the characteristic equation represent the eigenvalues of the

matrix A.

1.14 Calculating Eigenvectors

Each eigenvalue can correspond to a single eigenvector, or to many

eigenvectors. If a matrix A has an eigenvector v with an associated eigenvalue 𝜆, the

following equation holds:

Av = λv

which can also be rewritten this way:

(A − λ𝐼)v = 𝟎

To calculate the eigenvalues, we must:

• determine A − λ𝐼 by replacing λ with the value of an already known

eigenvector

• obtain an eigenvector v from the resulting equation

• apply the Gauss method using the same approach as with the equation

Ax = b

• transform the obtained matrix into a linear system

• find the eigenvectors corresponding to the respective eigenvalues


365 DATA SCIENCE 22
365 DATA SCIENCE 23

Section 2: Dimensionality Reduction Motivation

2.1 Feature Selection, Feature Extraction, and Dimensionality Reduction

Definitions:

Feature selection - the process of reducing the number of features used in

various types of machine learning models in a way that leaves the predictive ability of

the data preserved

Features - the transformed raw data after a process called “feature

engineering”

Feature engineering - the stage which prepares the input dataset and makes it

compatible with the machine learning algorithms

Features

Square footage Room count Window count

House 1 87 3 6

House 2 94 4 7

Observations (samples)

A dataset example containing features and observations

High-dimensional data – a dataset where the number of features exceed the

number of observations drastically

High-dimensional data often lead to various problems which can impact the

performance of the machine learning algorithm, making it unreliable. A set with many
365 DATA SCIENCE 24

features probably means that the volume is large, but the data points are very few

and far apart and this is problematic for the algorithm. In such cases, we could try to

remove the irrelevant features without losing essential information from the dataset.

Reducing the features can boost the training accuracy of the model, hence its

predictive power. By applying feature selection, we examine the key features that

affect our model, removing the unnecessary ones as a result.

Data sparsity – cases when the volume of a high-dimensional dataset is very

large, while the data points in it are few and far apart

Feature extraction – the process of transforming existing features into new ones

– linear combinations of the originals

Dimensionality reduction = feature selection + feature extraction

When applying feature selection, working with a subset of the original data

indicates that we’ll have fewer features compared to the original dataset, hence the

dimensional space would be lower. In feature extraction, although the number of the

newly constructed features might be the same as the original, we usually use only the

most significant ones. In particular, evaluating the retained variance in the dataset

helps us select those features that preserve the most information for our data.

2.2 The Curse of Dimensionality

The phenomenon describes a set of problems that arise when working with a

high-dimensional dataset. The phrase “curse of dimensionality” was coined by the


365 DATA SCIENCE 25

mathematician Richard Bellman and refers to the difficulties we face when optimizing

a function with many input variables. That said, the error can increase when adding

dimensions, or similarly, features to the dataset.

Richard Bellman

Troubles related to the curse of dimensionality:

• Lack of data points in comparison to the high dimensionality of the

model

• Huge distance between the data points

• Increased complexity caused by the growing number of possible

combinations of features

• Possibility of including more noise in the training data

• Problems with the data storage

• Training taking more time – larger input datasets increase the

computational complexity leading to longer training periods


365 DATA SCIENCE 26

Dealing with the curse of dimensionality = applying dimensionality

reduction

To battle the issue of large and sparse datasets, we use dimensionality

reduction to close up the feature space. This method contains no additional inputs,

which makes the data analysis a straightforward process for the machine learning

algorithm.
365 DATA SCIENCE 27

Section 3: Principal Component Analysis (PCA)

3.1 An Overview of PCA

Principal Component Analysis, also known as PCA is a feature extraction

algorithm used for dimensionality reduction. It is one of the oldest and most widely

used dimensionality reduction techniques in unsupervised learning. PCA is used to

reduce the number of features in a dataset into a smaller subset which preserves

most of the information from the original set.

• Works mainly with numerical data

• Doesn’t require too much information about the data

x
z

Three-dimensional dataset

The PCA algorithm constructs new axes, called principal components (PCs),

whose number equals the number of variables in our initial dataset. They are not the

same as the original features - these principal components capture the most variance
365 DATA SCIENCE 28

of the data in each direction. The goal is to capture the most valuable information in

the data.

Two-dimensional dataset

These axes are ordered by the amount of retained variance in the data.

Therefore, the first few components carry the most information about our data. That

means we could discard several less important components, without experiencing a

significant loss of information.

1 2 3
Principal Components
365 DATA SCIENCE 29

Retaining at least 80% of the original variance, usually means we’ve kept the

majority of important information. When we analyze the principal components, we

are normally interested in calculating and plotting metrics, such as variance captured

by each separate component, as well as cumulative variance. Plotting these can help

us decide how many of the principal components we are going to use and which

ones to discard. After we make this decision and we project our standardized data

onto the new axes (the PCs), we lower the dimension of our data, so dimensionality

reduction occurs.

PCA constructs the principal components so that they are uncorrelated,

meaning there is no linear relationship between them. Therefore, the components

are constructed in a way that makes the next component perpendicular to the

previous ones. We want axes that separate the points clearly, possibly distinguishing

the groups of points better than the original axes do. PCA will find these axes as the

dimension of our data.

Note: when the data we work with has only two explanatory variables, we can

decide to take both components produced by PCA. That won’t lower the dimension,

but rather give a more distinguishable view of the data.

When our data is higher dimensional, after all the PCs have been constructed

by PCA, if we want to lower the dimension, we need to project our data points onto а

subset of them. This subset will be carrying enough variance for the data. The

dimension is lower because we have chosen only the first few PCs as mentioned

before. This way our new axes are less than the original ones.
365 DATA SCIENCE 30

3.2 Step-by-step Explanation of PCA Through California Estates Example

The explanation consists of a practical example observed in Jupyter Notebook.

Here, we concentrate on the main steps describing the process, rather than analysing

the code. The code itself can be fully observed in the course content.

1. Exploring the data

California_Real_Estate_CSV.csv

The dataset contains 8 variables, 6 of which are numerical. PCA works with

numeric data, so we need to transform the categorical variables into numerical by

using dummy variables. The dataset needs to be preprocessed first.

2. Transformation of the data

The NaN values which appear after loading the data in Python must be

discarded.
365 DATA SCIENCE 31

• Standardizing the data – rescaling:

We want the algorithm to interpret all variables equally.

Mean = 0

Standard deviation = 1

• Constructing the principal components

• Analysing the variance captured by each principal component

Cumulative variance – the variance in the data captured up until a certain

component

• Choosing a subset of principal components which captures enough

variance – at least 80% of the total variance


365 DATA SCIENCE 32

3.3 The Theory Behind PCA

Covariance matrix – a matrix that shows the correlation between any pair of

variables in our initial dataset

If our data has “n” dimensions the associated covariance matrix will have shape

nₓn.

8 dimensions => 8ₓ8 covariance matrix

𝐶𝑜𝑣(𝑋1 , 𝑋1 ) 𝐶𝑜𝑣(𝑋1 , 𝑋 2 ) ⋯ 𝐶𝑜𝑣(𝑋1 , 𝑋ₙ)


C=( ⋮ ⋱ ⋮ )
1)
𝐶𝑜𝑣(𝑋ₙ, 𝑋 ⋯ 𝐶𝑜𝑣(𝑋ₙ, 𝑋ₙ)

General form of the covariance matrix for a dataset with n variables x

The covariance between two variables is symmetric - the covariance between

X₁ and X₂ is the same as the covariance between X₂ and X₁.

𝐶𝑜𝑣(𝑋1 , 𝑋 2 ) = 𝐶𝑜𝑣(𝑋 2 , 𝑋1 )

Each of the values of a nₓn covariance matrix is calculated by the formula:

1
𝐶𝑜𝑣(𝑋ᵢ, 𝑋ₖ) = 𝑛−1 ∑𝑛𝑚=0(𝑋ₘ − 𝑋̅)(Yₘ−𝑌̅)

𝑋̅ and 𝑌̅ - the means of the two variables respectively

The eigenvectors of the covariance matrix are the principal components.

➔ To find them, we must solve the eigenvector/eigenvalue problem -

Eigendecomposition

𝑑𝑒𝑡(𝐶 − µ𝐼) = 0
C – covariance matrix

I – identity matrix

µ - the unknown variable


365 DATA SCIENCE 33

Having obtained the eigenvalues, we solve the linear system 𝐂𝐱 = µ𝐱 for every

eigenvalue µ and we find the eigenvector x for the corresponding eigenvalue µ. After

determining all eigenvector- eigenvalue pairs, we can calculate the variance carried

in each principal component from the eigenvalues. By ranking our eigenvectors for

their eigenvalues from highest to lowest, we obtain the principal components in

order of significance. To compute the percentage of information accounted by each

component, we divide the eigenvalue of the respective component by the sum of all

eigenvalues.

3.4 PCA Covariance Matrix in Jupyter – Analysis and Interpretation

After determining the number of principal components, we will use in PCA, we

must interpret these components and their relationship to the original features. Since

each component is a combination of their initial variables, we must find those with the

biggest weights in the equation of each component, or, in other words, which

variables describe each of our components best.

• Loading the correlation between each component and each of our

variables

• Analysing the correlations and explaining the principal components via

the initial variables – interpreting the components

• Preparing the principal components to be the new axes

• Projecting the standardized data points onto the PCs


365 DATA SCIENCE 34

Section 4: Linear Discriminant Analysis (LDA)

4.1 Overall Mean and Class Means

Sample Features Category

Student Math English Gender

Jack 50% 31% Male

Jessica 73% 80% Female

James 62% 75% Male

Georgia 81% 90% Female

A dataset example containing students tests results

The length of the overall and the class means corresponds to the number of

features.

In this case the features are two,

➔ The overall and the class means vectors will be of length two.

Overall mean

𝑥
[ ] the first entry is the overall mean for the 1st feature – the Math exam
𝑥
scores

𝑥
[ ] the second entry is the overall mean of the 2nd feature – the English
𝑥
exam scores
365 DATA SCIENCE 35

Class means

Classes = number of class means

56 50+62 31+75
Mean (male) = [53] = 56 = 53
2 2

Math score English score

77 73+81 80+90
Mean (female) = [85] = 77 = 85
2 2

Math score English score

4.2 An Overview of LDA

LDA is a supervised learning algorithm used for dimensionality reduction. The

goal of LDA is to find a linear combination of features to best separate two or more

classes. LDA aims to:

• maximize the distances between the classes

• minimize the variance (scatter) of data within each class

Ronald Fisher
365 DATA SCIENCE 36

When performed, LDA constructs new axes, called linear discriminants, where

the data is being projected. In LDA, the number of linear discriminants is, at most, c-1,

where c is the number of distinct classes in our data. We are interested in

constructing axes that better separate the data points according to their labels and

into well-formed clusters.

LDA Overview

4.3 LDA – Calculating Between- And Within-class Scatter Matrices

𝑆𝑊 – Within-class scatter matrix

𝑆𝐵 – Between-class scatter matrix

When applied, LDA finds the between- and within-class scatter matrices by

using the class means and the overall data mean. Then, LDA finds the eigenvalues

and the eigenvectors of the matrix obtained from multiplying the inverse of 𝑆𝑊 with
365 DATA SCIENCE 37

𝑆𝐵 . These eigenvectors are the linear discriminants, or the axes of our low-

dimensional space.

2
(µ1 −µ2 )
- Fisher’s Discriminant Ratio
𝑠12 +𝑠22

Fisher’s Discriminant Ratio represents the idea behind LDA: maximizing the

distance between the class means, while minimizing the scatter. Considering the ratio,

this means maximizing the numerator, while minimizing the denominator.

Multi-class data – dataset containing more than two class labels

When we deal with multi-class data, we use the following approach:

• Calculating the overall mean of the data – centroid

• Measuring the distance from each class mean to the centroid

• Maximizing the distance between each class mean and the centroid

while minimizing the scatter for each category:

𝑑12 + 𝑑22 + 𝑑32


𝑠12 + 𝑠22 + 𝑠32

𝑑12 – the distance between the centroid and the mean of the class 1

𝑠12 – the scatter for the class 1

Calculating the between-class scatter matrix:


𝑐

𝑆𝐵 = ∑ 𝑁𝑖 (mi − m)(mi − m)𝑇


𝑖=1

Ni − the number of points (observations) in the ith class

mi − the mean vector of the ith class


365 DATA SCIENCE 38

m − the mean vector of the whole dataset

i – the number of classes

c – the class

With the between-class scatter matrix we measure how far away the single class

clusters are.

Calculating the within-class scatter matrix:


𝑐

𝑆𝑤 = ∑ 𝑆𝑖
𝑖=1

𝑆𝑖 – the covariance matrix for the ith class

𝑆𝑖 = ∑(x − mi )(x − mi )𝑇
x𝜖𝐶𝑖

The within-scatter matrix represents the sum of the covariance matrices for each

separate class. Each class covariance matrix shows the relations between the features

in this class.

Covariance matrix – a symmetric square matrix giving the covariance between

each pair of elements

Rewriting Fisher’s ratio:

𝑎𝑟𝑔𝑚𝑎𝑥 𝑣 𝑇 𝑆𝐵 𝑣
𝑤
̃=
𝑣 𝑣 𝑇 𝑆𝑤 𝑣

maximizing the between-class scatter 𝑆𝐵 , and minimizing the within-class scatter

𝑆𝑤

̃ - the direction which gives maximum class separability of the dataset


𝑤

This ratio can also represent the eigenvalue problem given by the following formula:
365 DATA SCIENCE 39

𝑆𝑤 𝑣 = µ𝑆𝐵 𝑣

We need to find the eigenvalues and their corresponding eigenvectors of the

matrix, produced by the multiplication between the inverse within-scatter matrix with

the between-scatter matrix.

After finding the eigenpairs, we must arrange the eigenvalues from largest to

smallest, and reorder their eigenvectors accordingly. The number of linear

discriminants we will use when performing LDA depends on the selecting of the most

important ones – those with the highest eigenvalues.

4.4 Step-by-step Explanation of Performing LDA on the Wine-quality Dataset

The explanation consists of a practical example observed in Jupyter Notebook.

Here, we concentrate on the main steps describing the process, rather than analyzing

the code. The code itself can be fully observed in the course content.

• Exploring the data


365 DATA SCIENCE 40

winequality.csv

Since our data is numerical, one thing we should analyze is the range of values

each column takes. In particular, the quality column that shows how many different

grades for wine quality we have, as well as what actual grades there are.

Our goal will be to predict the wine grades, which means that this will be our

predictor column.

We see that the quality ranges from 3 to 8, therefore there are 6 distinct classes

for our data.

• Transformation of the data

The data consists of 12 columns overall, 11 of which are feature columns. The

last one, called quality, is the column that represents the data labels. Our goal is to

separate the wine samples from the data and group them into clusters, according to

their quality grade.


365 DATA SCIENCE 41

Each row in the data represents a particular wine, while each column describes

certain characteristics. We have acidity levels, PH levels, as well as alcohol content,

and plenty more.

• Standardizing the data – rescaling

The quality grades have no numerical influence on the data, instead they

simply serve as labels, which is why we don’t standardize them.

• Calculating the class mean – the class mean for each separate quality

grade of wine

• Reading and interpreting the results

4.5 Calculating the Within and Between-Class Scatter Matrices

We calculate the within- and between-scatter matrices using Python in order to

find the product’ eigenvalues and eigenvectors of the inverse of the within-scatter

matrix and the between-scatter matrix, so that we can make a feature selection and

lower the dimension of our dataset.


365 DATA SCIENCE 42

• Calculating the covariance matrix for each separate quality grade and

then summing those individual class covariance matrices up, resulting in

a within-class scatter matrix based on the formula:

𝑆𝑤 = ∑ 𝑆𝑖
𝑖=1

Within-scatter class matrix

• Calculating the between-class scatter matrix

1) Finding the overall mean

2) Finding the class means

3) Imposing the formula


𝑐

𝑆𝐵 = ∑ 𝑁𝑖 (mi − m)(mi − m)𝑇


𝑖=1

Between-scatter class matrix

• Multiplication between the inverse within-scatter matrix with the

between-scatter matrix

𝑆𝑤−1 𝑆𝐵
365 DATA SCIENCE 43

4.6 Calculating Eigenvectors and Eigenvalues for the LDA

By finding the eigenvectors and ordering them by level of importance, we’ll

construct the linear discriminants - our new axes. These new axes will comprise our

lower-dimensional space onto which we’ll project the data points. In Python, this is a

straightforward process.

• First, by using the method .eig() of the submodule linalg in NumPy and

providing the matrix product as arguments, we can get the eigenvalues

and eigenvectors

• Next, we need to pair and sort them in descending order, according to

the eigenvalues - we divide each eigenvalue by the sum of all

eigenvalues and multiply that by 100. This will give us the percentage of

variance explained by the eigenvector associated with this eigenvalue

4.7 Analysis of LDA

To find the newly constructed axes’ significance, we must add all eigenvalues

and divide each of them separately by the overall sum to obtain the percentage of

retained variance.

We must project the data onto the discriminants that make up more than 80%

of the total variance. That’s where dimensionality reduction occurs– instead of the

original number of features, we’ll have less (in this case – two) new linear discriminants

that correspond to the two highest eigenvalues.


365 DATA SCIENCE 44

In the observed practical example, the first two linear discriminants make up

around 95% of the total variance in the data. So, instead of the original eleven

features, we’ll have our two new linear discriminants that correspond to the two

highest eigenvalues. We’ve now performed dimensionality reduction.

4.8 LDA vs PCA

• Comparing dimensionality reduction on the same dataset performed by

both LDA and PCA

• Splitting the dataset into training and testing parts and applying apply

LDA and PCA consecutively

• After the data is projected onto the linear discriminants in the case of

LDA, and onto the principal components in the case of PCA - training

and testing the classifier on these newly obtained datasets

• Timing the classifier for the training and testing part for both LDA and

PCA

• Running the functions which train and test the classifier on the projected

datasets from both dimensionality reduction techniques to compare the

average training and testing times, assessing whether the classifier is

faster when we perform LDA or PCA as a preliminary step

• Comparing the accuracy of the classifier by calculating a confusion

matrix for the results in both approaches


365 DATA SCIENCE 45

• Analysing the results

Analysing the results confirm that there is a much better separation of the data

with the LDA plot in terms of the quality grade given.

4.9 Setting up the Classifier to Compare LDA and PCA

Why do we split the initial features dataset into training and testing

instead of making use of the already standardized dataset?

We must first start with discussing the standardization tool - “Standard Scaler”

and the three applied methods:

• .fit() - takes the dataset we aim to standardize as an argument and

computes its mean and standard deviation. These will then be used in

the formula to rescale each feature. In the case of the Standard Scaler

this transform each feature to have a mean of 0 and a standard deviation

of 1.

• .transform() - Applies the rescaling formula to every feature and

transforms them. In this case the resulting features have a zero mean

and standard deviation 1.

• .fit_transform() - applies both methods – fit and transform

Answer: using .fit transform() on the train data results in standardizing it. The

method calculates the mean and standard deviation for each feature. We use

.transform() on the test data only. This means that we’re applying a formula where µ is
365 DATA SCIENCE 46

the mean of the training set for each feature, while σ is the standard deviation of the

training set of each feature. Thus, we apply the metrics calculated from the training

set onto the test set. The testing set aims to serve as unseen data, to model real-life

scenarios.

4.10 Coding the Classifier for LDA and PCA

• creating an instance of the class LDA with two linear discriminants

• fitting the training data

• fitting the testing data

• creating an instance of the class PCA with two principal components

• fitting the training data

• fitting the testing data

• importing the support vector classifier

• training and testing the classifier, while timing the code

• predicting the time of using the classifier for both algorithms

• comparing the accuracy of the classifying model on both datasets


365 DATA SCIENCE 47

4.11 Analysis of the Training and Testing Times for the Classifier and its

Accuracy

Ideally, we’d like to time our code for both LDA and PCA and figure out how

accurate each analysis is. To make it more precise, we can run the functions for

training and testing PCA and LDA a grand total number of ten times and take the

average for training and testing in both functions.

Analysing the results:

The results for training times, as well as the predicting ones, are very close.

However, the real metric of importance is the classifier’s accuracy. But before that, we

need to emphasize another metric – the confusion matrix.

Confusion matrix - a square matrix that shows which samples were graded

correctly and which were not. In our case, it shows us the number of wines that we

correctly identified in class 3, as well as the number of wines, which are in class 3, but

we’ve misclassified. The same goes for the wines in class 4 through 8, as well. We use

the confusion matrix because it helps us improve our model by analysing the

misclassified data and adjusting the parameters of the model accordingly. It shows a

deeper level of measuring performance compared to a simple accuracy

measurement.

After the confusion matrix is ready, we’ll measure the accuracy by calling

“accuracy_score” on the correct and predicted quality grade values

Results:

Accuracy for LDA: 63%


365 DATA SCIENCE 48

Accuracy for PCA: 54%

LDA outperformed PCA with 9%. As expected, LDA also did much better when

preparing labelled data for classification.

Author Name
365 DATA SCIENCE 49

Aleksandar Samsiev
Ivan Manov
Email:
Email: [email protected]
Address:

You might also like