0% found this document useful (0 votes)
12 views27 pages

Lecture-03 - Vectors and Matrices

The lecture focuses on the mathematical foundations of machine learning, particularly vectors and matrices, and their application in linear models for prediction. It discusses how to represent training data using feature matrices and the process of learning weight vectors through matrix operations. Additionally, it covers concepts like inner and outer products, matrix factorization, and their relevance in recommendation systems.

Uploaded by

kimsergey606
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views27 pages

Lecture-03 - Vectors and Matrices

The lecture focuses on the mathematical foundations of machine learning, particularly vectors and matrices, and their application in linear models for prediction. It discusses how to represent training data using feature matrices and the process of learning weight vectors through matrix operations. Additionally, it covers concepts like inner and outer products, matrix factorization, and their relevance in recommendation systems.

Uploaded by

kimsergey606
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

Lecture 3

Vectors and Matrices in


Machine Learning
SWCON253, Machine Learning
Won Hee Lee, PhD
Mathematical Foundations for
Machine Learning
• Vectors and Matrices
• Least Squares and Geometry
• Least Squares and Optimization
• Subspaces, Bases, and Projections
• Introduction to Singular Value Decomposition
• Singular Value Decomposition in Machine Learning
Learning Goals
• Understand the fundamental concepts of
vectors and matrices in machine learning
Learning: Given training data (𝑥𝑖 , 𝑦𝑖 ) for i = 1, 2, …, n
where n is the number of training samples,

𝑥𝑖 ∈ ℝ𝑝 is a vector of p real numbers called the feature.


(i.e., 𝑥𝑖 is the feature vector = a list of numerical features for each sample),
and 𝑦𝑖 ∈ −1, +1 or 𝑦𝑖 ∈ 0, 1 or 𝑦𝑖 ∈ ℝ is the label.

𝑥𝑖 (feature vector) can be written as


𝑥𝑖1
𝑥
𝑥𝑖 = 𝑖2 ∈ ℝ𝑝

𝑥𝑖𝑝

Goal: Learn a model that predicts a label 𝑦ො given a feature vector 𝑥 (training data),
i.e., from training data, learn how to predict label 𝑦ො for new sample 𝑥0 .
Goal: Learn a model that predicts a label 𝑦ො given a feature vector 𝑥 (training
data), i.e., from training data, learn how to predict label 𝑦ො for new sample 𝑥0 .

Linear Model

𝑦ො = weighted sum of the features

𝑦ො = 𝑤1 𝑥01 + 𝑤2 𝑥02 + … + 𝑤𝑝 𝑥0𝑝

𝑦ො = 𝑤1 𝑥01 + 𝑤2 𝑥02 + … + 𝑤𝑝 𝑥0𝑝


where 𝑤1 , … , 𝑤𝑝 = weights to be learned from data
Use training data to find 𝑤1 , … , 𝑤𝑝 such that 𝑦ො𝑖 ≈ 𝑦𝑖 for i = 1,…,n
(more precisely, we want loss function L(𝑦ො𝑖 , 𝑦𝑖 ) small)

𝑤1
𝑤
Let 𝑤 = 2 ∈ ℝ𝑝 be the weight vector

𝑤𝑝

Once we have two vectors, we can think about taking their inner product

𝒘, 𝒙𝒊 = 𝑤1 𝑥𝑖1 + 𝑤2 𝑥𝑖2 + ⋯ + 𝑤𝑝 𝑥𝑖𝑝

i.e., 𝑤, 𝑥𝑖 = 𝑤1 𝑥𝑖1 + 𝑤2 𝑥𝑖2 + ⋯ + 𝑤𝑝 𝑥𝑖𝑝 = 𝑦ො𝑖 = prediction for ith sample

We are using these inner products to make predictions for samples,


i.e., this linear model essentially corresponds to computing inner products between weight
vectors and feature vectors.
Vector transpose

Using vector transpose


𝑥𝑖𝑇 = 𝑥11 𝑥12 … 𝑥𝑖𝑝

For example,
−2
p = 2, 𝑤 = when is 𝑦ො = 𝑤, 𝑥𝑖 > 0 and when is 𝑦ො < 0 ?
1

Blue line = all points 𝑥 where


𝑤, 𝑥𝑖 = −2𝑥01 + 𝑥02 = 0

𝑤, 𝑥𝑖 = 𝑤1 𝑥01 + 𝑤2 𝑥02 > 0

➔ −2𝑥01 + 𝑥02 > 0


𝑥
➔ 𝑥01 < 202
➔ 2𝑥01 < 𝑥02
We might also consider 𝑦ො𝑖 = 𝑤0 + 𝑤1 𝑥𝑖1 + 𝑤2 𝑥𝑖2 + … + 𝑤𝑝 𝑥𝑖𝑝
We can handle this by letting the 1st element of 𝑥𝑖 be = 1,
which can be done implicitly by defining:

1 𝑤0
𝑥𝑖1 𝑤1
𝑥𝑖 = 𝑥𝑖2 , 𝑤 = 𝑤2 ∈ ℝ𝑝+1
⋮ ⋮
𝑥𝑖𝑝 𝑤𝑝

Now, our model is 𝑦ො = 𝑤, 𝑥𝑖 , same as before!


𝑤, 𝑥𝑖 = 𝑤0 + 𝑤1 𝑥𝑖1 + 𝑤2 𝑥𝑖2 + … + 𝑤𝑝 𝑥𝑖𝑝
In summary,

𝑥𝑖1
𝑥𝑖2
Let 𝑥𝑖 = ∈ ℝ be the feature vector.

𝑥𝑖𝑝

Then, our linear model can be equivalently written as

𝑥𝑖1
𝑥𝑖2
𝑦ො = 𝑤, 𝑥𝑖 = 𝑤 𝑇 𝑥𝑖 = 𝑤1 , … , 𝑤𝑝 ∙ = 𝑥𝑖 𝑇 𝑤 = 𝑥𝑖 , 𝑤

𝑥𝑖𝑝

𝑤, 𝑥𝑖 = inner product of two vectors


T = transpose
Matrix Representation

Ultimately, we need to use training data to learn the “best” weight vector.
To express this objective more compactly, we can use a matrix representation. Define:

Feature matrix (or design matrix)

𝑥11 𝑥12 … 𝑥1𝑝


𝑥21 𝑥22 𝑥2𝑝
X= ∈ ℝ𝑛x𝑝 Feature
⋮ ⋮
𝑥𝑛1 𝑥𝑛2 … 𝑥𝑛𝑝
Sample
ℝ𝑛x𝑝 is real matrix with n rows and p columns
xij = jth feature of ith sample
ith row of X = p features of ith training sample = 𝑥𝑖 𝑇
jth column of X = feature j for all n samples

e.g., x21 =1st feature of 2nd training sample; x12 =2nd feature of 1st training sample
i.e., columns corresponding to different features
Then, we can write our linear model using matrix representation as:

𝑦1 𝑥11 𝑥12 … 𝑥1𝑝 𝑤1


𝑦2 𝑥21 𝑥22 𝑥2𝑝 𝑤2
𝑦ො = ⋮ =
⋮ ⋮ ⋮ = X𝑤  Linear model for all n samples in a single equation
𝑦𝑛 𝑥𝑛1 𝑥𝑛2 … 𝑥𝑛𝑝 𝑤𝑝

Computing X𝑤 means taking the inner product of each row of X with 𝑤 and
storing the results in the vector 𝑦.

ෝ = 𝐗𝒘
𝒚
Note that dimensions should always match
Example:

1 0
X= 2 0 n = 3 training samples, p = 2 features
0 3

1 2 0
x1= , x2 = , x3=
0 0 3

1 0 2
2 2
if w= , then Xw = 2 0 = 4
4 4
0 3 12

Another perspective: Xw is a weighted sum of the columns of X, where w gives the weights

1 0 2 0 2
Xw = 2 2 + 4 0 = 4 + 0 = 4
0 3 0 12 12

Xw = weighted sum of columns (features) of X


• Using a linear model, we make our predictions as a weighted sum of the
features.

• Now remember when we represented our features in a matrix X, each column


corresponds to a different feature across all samples.

• So when we write the matrix X times a weight vector w (Xw), we are


literally computing a weighted sum of the columns (features), where the
weights are given by w.

ෝ = 𝐗𝒘
𝒚

• Thus, our predicted outputs 𝒚 ෝ are formed by taking this linear combination of
the feature columns – that’s the core idea behind a linear model.
This does not look a
straight line, but linear
models can still help!
This does not look a
straight line, but linear
models can still help!
If p=4

𝑦ො = 𝑋𝑤 implies
𝑦ො𝑖 = 𝑤, 𝑥𝑖 = 𝑤1 1+ 𝑤2 𝑥𝑖 + 𝑤3 𝑥𝑖2 + 𝑤4 𝑥𝑖3
= cubic polynomial that fits training
samples perfectly!
matrix with this special structure
is called Vandermonde matrix
Matrix-Matrix Multiplication

So far, we've looked at how to multiply a matrix by a vector.


But if we want to find the best weight vector 𝒘 that fit our training data, we'll also need to work
with matrix-matrix multiplication.

Example: Recommender system

• Definition: A system that predicts user preferences and suggests relevant content
• Recommendation systems often use user-item interaction matrices
• Matrix operations help in predicting missing ratings
• Key technique: Matrix factorization (e.g., singular value decomposition)
User-Item Rating Matrix

Example matrix (users × items):


A matrix where each row represents a user and each column represents an item

Parasite John Wick La La Land


User 1 5 ? 3
User 2 ? 4 2
User 3 1 ? 5

• ? → Missing ratings that need to be predicted


• Goal: Fill in missing values using matrix factorization
Matrix Factorization for Recommendations

Objective: Approximate the user-item Matrix as the product of two matrices:

R ≈ U × VT

• User matrix (U): Represents user preferences (user × features)


• Item matrix (V): Represents item characteristics (item × features)

Dimensionality Reduction:
Instead of storing full ratings, store only latent features of users and items
Matrix-Matrix Multiplication

Suppose we decompose a 3 × 3 user-item matrix into:


• U (users × features)
• V (items × features)

5 ? 3 0.8 0.6 𝑇
0.9 0.4 0.7
? 4 2 ≈ 0.7 0.5 ×
0.5 0.8 0.6
1 ? 5 0.3 0.9

Multiply U × VT to reconstruct the original matrix

✓ Matrix-matrix multiplication is fundamental in recommendation systems.


✓ Matrix factorization helps predict missing ratings effectively.
Matrix-Matrix Multiplication X = 𝑈𝑉

Definition: Inner and Outer production

• Inner product:

If u and v are column vectors with the same size, then uTv is the inner product of u and v.
This results in a single scalar value.

• Outer product:

If u and v are column vectors of any size, then uvT is the outer product of u and v.
This results in a matrix where each entry is the product of the corresponding elements from u and v.
Matrix-Matrix Multiplication X = 𝑈𝑉

Inner product representation

Outer product representation


Matrix-Matrix Multiplication X = 𝑈𝑉

Inner product representation

Outer product representation


Inner product

Example,

𝑢1 𝑣1
u = 𝑢2 , v = 𝑣2 ,
𝑢3 𝑣3
𝑣1
uTv = 𝑢1 𝑢2 𝑢3 𝑣2 = 𝑢1 𝑣1 + 𝑢2 𝑣2 + 𝑢3 𝑣3
𝑣3

uTv = 0 ➔ u and v are orthogonal (perpendicular)

𝑢 = (𝑢𝑇 𝑢)1/2 = 𝑢12 + 𝑢22 + 𝑢32 ➔ norm (giving the length of the vector)

A vector is normalized if its norm equals to 1, i.e., 𝑢 = 1

Two vectors are both orthogonal and normalized ➔ orthonormal


Outer product

Example,

𝑢1 𝑣1
u = 𝑢2 , v = 𝑣2 ,
𝑢3 𝑣3

𝑢1 𝑢1 𝑣1 𝑢1 𝑣2 𝑢1 𝑣3
uvT = 𝑢2 𝑣1 𝑣2 𝑣3 = 𝑢2 𝑣1 𝑢2 𝑣2 𝑢2 𝑣3
𝑢3 𝑢3 𝑣1 𝑢3 𝑣2 𝑢3 𝑣3
Matrix-Matrix Multiplication as the Sum of Outer Products

Matrix-matrix multiplication can be interpreted as the sum of outer products between the rows of the first matrix
and the columns of the second matrix

Example,

1 6 13 5
1 −1
U= 3 4, V= UV = 11 1
2 1
5 2 9 −3

Outer product representation

1 6 1 −1 12 6 13 5
3 1 −1 + 4 2 1 = 3 −3 + 8 4 = 11 1
5 2 5 −5 4 2 9 −3
Further Readings
• Any linear algebra books should be fine!

• 기계학습 (저자: 오일석) 2.1절 has been posted to e-campus


• Skip 2.1.3 “Perceptron” part for now

• Mathematics for Machine Learning (MathML)


• Chapter 2: Linear Algebra

You might also like