FL LectureNotes
FL LectureNotes
Abstract
∗
AJ is currently Associate Professor for Machine Learning at Aalto University (Finland).
This work has been partially funded by the Academy of Finland (decision numbers 331197,
331197) and the European Union (grant number 952410).
1
Contents
1 Lecture - “Welcome and Intro” 1
1.1 Learning Goals . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Prerequisites . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Related Courses . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.5 Main Goal of the Course . . . . . . . . . . . . . . . . . . . . . 6
1.6 Outline of the Course . . . . . . . . . . . . . . . . . . . . . . . 7
1.7 Assignments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.8 Student Project . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.9 Schedule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.10 Ground Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2
3.3.1 Computational Aspects of GTVMin . . . . . . . . . . . 8
3.3.2 Statistical Aspects of GTVMin . . . . . . . . . . . . . 9
3.4 Assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3
7.1 Learning Goals . . . . . . . . . . . . . . . . . . . . . . . . . . 1
7.2 Measuring (Dis-)Similarity Between Datasets . . . . . . . . . . 1
7.3 Graph Learning Methods . . . . . . . . . . . . . . . . . . . . . 2
7.4 Assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Glossary 1
4
Lists of Symbols
a∈A This statement indicates that the object a is an element of the set A.
A⊆B A is a subset of B.
5
{0, 1} The binary set that consists of the two real numbers 0 and 1.
6
Matrices and Vectors
A generalized identity matrix with l rows and d columns. The
Il×d entries of Il×d ∈ Rl×d are equal to 1 along the main diagonal
and equal to 0 otherwise.
T
The transpose of a matrix X ∈ Rm×d . A square real-valued
X
matrix X ∈ Rm×m is called symmetric if X = XT .
T
0 = 0, . . . , 0 A vector of zero entries.
7
T The vector of length d + d′ obtained by concatenating the
vT , wT ′
entries of vector v ∈ Rd with the entries of w ∈ Rd .
8
Probability Theory
9
Machine Learning
The feature space X is the set of all possible values that the
X
features x of a data point can take on.
x(r) The feature vector of the rth data point within a dataset.
(r)
xj The jth feature of the rth data point within a dataset.
x(r) , y (r) The features and label of the rth data point.
12
A regularization parameter that controls the amount of regu-
λ
larization.
T
A parameter vector w = w1 , . . . , wd whose entries are
parameters of a model. These parameters could be feature
w
weights in linear maps, the weights in ANNs or the thresholds
used for splits in decision trees.
A feature map ϕ : X → X ′ : x 7→ x′ := ϕ x ∈ X ′ .
ϕ(·)
13
Federated Learning
x(i,r) The features of the r-th data point in the local dataset D(i) .
y (i,r) The label of the r-th data point in the local dataset D(i) .
14
1 Lecture - “Welcome and Intro”
Welcome to the course CS-E4740 Federated Learning. This course can be
completed fully remote. Any on-site event will be recorded and made available
to students via this YouTube channel. The basic variant (5 credits) of this
course consists of lectures (schedule here) and corresponding coding assign-
ments (schedule here). We test your completion of the coding assignments
via quizzes (implemented on the MyCourses page). You can upgrade the
course to an extended variant (10 credits) by completing a student project
(see Section 1.8).
1.2 Introduction
1
via co-morbidity networks [11]. Social science uses notions of acquaintance to
relate data collected from be-friended individuals [12].
Federated learning (FL) is an umbrella term for distributed optimization
techniques to train machine learning (ML) models from decentralized collec-
tions of local datasets [13–17]. These methods carry out computations, such
as gradient steps (see Lecture 4), for ML model training at the location of
data generation. This design philosophy is different from a naive application
of ML techniques, which is first to collect all local datasets at a single location
(computer). We can then feed this pooled data into a conventional ML method
like linear regression.
The distributed training of ML models, at locations close to the actual
data generation, can be beneficial for several reasons [18]:
2
phones that can communicate via radio links. This parallel computer
allows to speed up computational tasks such as the computation of
gradients required to train ML models (see Lecture 4).
1.3 Prerequisites
3
algebraic and geometric structure of Rd . By algebraic structure, we mean
the (real) vector space obtained from the elements (“vectors”) in Rd along
with the usual definitions of vector addition and multiplication by scalars
in R [23, 24]. We will make heavy use of concepts from linear algebra to
represent and manipulate data and ML models.
The metric structure of Rd will be used to study the (convergence) be-
haviour of FL algorithms. In particular, we will study FL algorithms that
are obtained as fixed-point iterations of some non-linear operator on Rd
which depends on the data (distribution) and ML models used within a FL
system. A prime example for such a non-linear operator is the gradient step
of gradient-based methods (see Lecture 4). The computational properties
(such as convergence speed) of these FL algorithms can then be characterized
via the contraction properties of the underlying operator [25].
A main tool for the design the FL algorithms are variants of gradient
descent (GD). These gradient-based methods are based on approximating a
differentiable function f (x) locally by a linear function given by the gradient
∇f (x). We therefore expect some familiarity with multivariable calculus [5].
In what follows we briefly explain how this course CS-E4740 relates to selected
courses at Aalto University.
4
models for local datasets. This coupling is required to adaptive pool
local datasets obtain a sufficiently large training set for the personalized
ML model.
5
• ELEC-E5424 - Convex Optimization. This course teaches advanced
optimisation theory for the important class of convex optimization
problems [28]. Convex optimization theory and methods can be used
for the study and design of FL algorithms.
V := {1, . . . , n}.
(i) (i′ )
X X
Li w(i) + λ Ai,i′ d(w ,w ) . (1.1)
min
w(i)
i∈V {i,i′ }∈E
6
The optimization variables w(i) in (1.1) are local model parameters at the
nodes i ∈ V of an empirical graph. The objective function in (1.1) consists
of two components: The first component is a sum over all nodes of the loss
values Li w(i) incurred by local model parameters at each node i. The
second component is the sum of local model parameters variations across the
edges {i, i′ } of the empirical graph.
7
regularized empirical risk minimization (RERM) which we refer to as
GTVMin. GTVMin uses the variation of personalized model parameters
across edges in the empirical graph as regularizer. We will see that
GTVMin couples the training of tailored (or “personalized”) ML models
such that well-connected nodes (clusters) in the empirical graph will
obtain similar trained models. Lecture 4 discusses variations of gradient
descent as our main algorithmic toolbox for solving GTVMin. Lecture
5 shows how FL algorithms can be obtained in a principled fashion
by applying optimization methods, such as gradient-based methods,
to GTVMin. We will obtain FL algorithms that can be implemented
as iterative message passing methods for the distributed training of
tailored (“personalized”) models. Lecture 6 derives some main flavours
of FL as special cases of GTVMin. The usefulness of GTVMin crucially
depends on the choice for the weighted edges in the empirical graph.
Lecture 7 discusses graph learning methods that determine a useful
empirical graph via different notions of statistical similarity between
local datasets.
8
1.7 Assignments
The course will consist of assignments, each covering the topics of a correspond-
ing lecture. Each assignment requires you to implement the concepts discussed
in the corresponding lecture using Python. After solving the assignment, you
can answer MyCourses quizzes.
You can extend the basic variant (which is worth 5 credits) to 10 credits by
completing a student project and peer review. This project requires you to
formulate an application of your choice as a FL problem using the concepts
from this course. You then have to solve this FL problem using the FL
algorithms taught in this course. The main deliverable will be a project
report which must follow the structure indicated in the template. You will
then peer-review the reports of your fellow students by answering a detailed
questionnaire.
1.9 Schedule
The course lectures are held on Mo. and Wed. at 16.15, during 28-Feb-2024
until 30-Apr-3024. You can find the detailed schedule and lecture halls
following this link. As the course can be completed fully remote, we will
record each lecture and add the recording to the YouTube playlist here in a
timely fashion.
After each lecture, we will release the corresponding assignment at this
site. You will have then at least one week to work on the assignment before
9
we open the corresponding quiz on the MyCourses page of the course (click
me).
Note that as a student following this course, you must act according to the
Code of Conduct of Aalto University. In particular, the main ground rules
for this course are:
10
2 Lecture - “ML Basics”
This lecture covers basic ML techniques that are instrumental for FL. This
lecture is signifcantly more extensive content-wise compared to the following
lectures. However, it should be relatively easy to follow as it mainly refreshes
pre-requisite knowledge.
• be familiar with the concept of data points (their features and labels),
model and loss function,
1
During this course we will focus mainly on one specific choice for the data
points. In particular, we will consider data points that represent the daily
weather condition around a weather station of the Finnish Meteorological
Institute (FMI). We denote a specific data point by z. It is characterized by
the following features:
• latitude lat and longitude lon of the weather station, e.g., lat := 60.37788,
lon := 22.0964,
As our notation indicates (using the symbol “∈” instead of “:=”), there
might be several different solutions to the optimization problem (2.1). Unless
specified otherwise, ĥ can be used to denote any hypothesis in H that has
minimum average loss over D.
2
Many machine learning (ML) methods employ a paramterized model H
where each hypothesis h ∈ H is defined by a parameter vector w ∈ Rd . A
prominent instance of such a parameterized model is the linear model [4, Sec.
3.1],
H(d) := h(x) := wT x. (2.2)
Note that (2.3) amounts to finding the minimum of a smooth and convex
function
T T T T
f (w) = (1/m) w X Xw − 2y Xw + y y (2.4)
T
with the feature matrix X := x(1) , . . . , x(m) (2.5)
T
and the label vector y := y (1) , . . . , y (m) of the training set D.
(2.6)
b (LR) ∈ argmin wT Qw + wT q
w (2.7)
w∈Rd
To train a ML model H means to solve ERM (2.1) (or (2.3) for linear
regression); the dataset D is therefore referred to as a training set. The trained
3
wT Qw + wT q
b (LR)
w
4
which are hopefully increasingly accurate approximations to a solution ĥ of
(2.1). The computational complexity of such a ML method can be measured
by the number of iterations required to guarantee some prescribed level of
approximation.
For a parameterized model and a smooth loss function, we can solve (2.3)
by gradient-based methods: Starting from an initial parameters w(0) , we
iterate the gradient step:
m
X T
= w(k−1) + (2α/m) x(r) y (r) − w(k−1) x(r) . (2.8)
r=1
How much computation do we need for one iteration of (2.8)? How many
iterations do we need ? We will try to answer the latter question in Lecture
4. The first question can be answered more easily for typical computational
infracstructure (e.g., “Python running on a commercial Laptop”). Indeed, a
naive evaluation of (2.8) requires around m arithmetic operations (addition,
multiplication).
It is instructive to consider the special case of a linear model which does
not use any feature, i.e., h(x) = w. For this extreme case, the ERM (2.3) has
a simple closed-form solution:
m
X
w
b = (1/m) x(r) . (2.9)
r=1
Thus, for this special case of the linear model, solving (2.9) amounts to
summing m numbers x(1) , . . . , x(m) . It seems reasonable to assume that the
amount of computation required to compute (2.9) is proportional to m.
5
2.4 Statistical Aspects of ERM
to an arbitrary data point with label y and features x that is not contained
in the training set. What can we say about the resulting prediction error
y − h(w)
b
(x) in general? In other words, how well does h(w)
b
generalize beyond
the training set.
Maybe the most widely used approach to study generalization of ML
methods is via a probabilistic perspective. Here, we interpret each data point
as a realization of an i.i.d. RV with probability distribution p(x, y). Under
this i.i.d. assumption, we can evaluate the overall performance of a hypothesis
h ∈ H via the expected loss (or risk)
A simple calculation reveals the expected squared error loss of a given linear
hypothesis h(x) = xT w
b as
b 2 + σ2.
E{(y − h(x))2 } = ∥w − w∥ (2.12)
6
The first component of the RHS in (2.12) is the estimation error ∥w − w∥
b 2
of a ML method that reads in the training set and delivers an estimate w
b
(e.g., via (2.3)) for the parameters of a linear hypothesis.
We next study the estimation error w − w
b incurred by the specific estimate
w b (LR) (2.7) delivered by linear regression methods. To this end, we first
b =w
use the probabilistic model (2.11) to decompose the label vector y in (2.6) as
T
y = Xw + n , with n := ε(1) , . . . , ε(m) . (2.13)
b (LR) ∈ argmin wT Qw + wT q′ + wT e
w (2.14)
w∈Rd
b (LR) − w (2.16)
w 2
≤ ∥e∥2 /λmin Q .
Rd×d . Note that the matrix Q is psd and therefore its eigenvalues are all
real-valued and non-negative [24]. Moreover, since we assume Q is invertible,
they are strictly positive and, in turn, λmin Q > 0.
1
Can you think of sufficient conditions on the feature matrix of the training set that
ensure Q = (1/m)XT X is invertible?
7
T T ′
wT e
w Qw + w (q + e)
wT Qw + wT q′
b (LR)
w w
8
1. gather a dataset and choose a model H
2. split dataset into a training set D(train) and a validation set D(val)
9
We can also use the performance of human experts as a baseline. If we
want to develop a ML method that detects certain type of skin cancers from
images of the skin, a benchmark might be the current classification accuracy
achieved by experienced dermatologists [30].
We can diagnose a ML method by comparing the training error Et with
the validation error Ev and (if available) the benchmark E (ref) .
10
than the baseline. There can be several reasons for this to happen.
First, it might be that the hypothesis space is too small, i.e., it does
not include a hypothesis that provides a good approximation for the
relation between features and label of a data point. One remedy to
this situation is to use a larger hypothesis space, e.g., by including
more features in a linear model, using higher polynomial degrees in
polynomial regression, using deeper decision trees or ANNs (deep ANN
(deep net)s). Second, besides the model being too small, another reason
for a large training error could be that the optimization algorithm used
to solve ERM (2.17) is not working properly (see Lecture 4).
Whenever the data points behave different than the the realizations
of i.i.d. RVs or if the size of the training set or validation set is too
small, the interpretation (and comparison) of the training error and
the validation error of a learnt hypothesis becomes more difficult. As
an extreme case, the validation set might consist of data points for
11
which every hypothesis incurs small average loss. Here, we might try
to increase the size of the validation set by collecting more labeled
data points or by using data augmentation (see Section 2.6). If the
size of training set and validation set are large but we still obtain
Et ≫ Ev , one should verify if data points in these sets conform to the
i.i.d. assumption. There are principled statistical test for the validity of
the i.i.d. assumption for a given dataset (see [31] and references therein).
2.6 Regularization
• collect more data points, possibly via data augmentation (see Fig. 3),
• add penalty term λR h to average loss in ERM (2.1) (see Fig. 3),
12
regularizer R h := ∥w∥22 for a linear hypothesis h(x) := wT x. Thus, ridge
The objective function in (2.18) is also obtained if we replace each data point
(x, y) ∈ D by a sufficient large number of i.i.d. realizations of
label y
h(x)
original training set D
augmented
λ
1 Pm
x(r) , y (r) , h +λR h
m r=1 L
feature x
13
(2.18) as
b (ridge) ∈ argmin wT Qw + wT q
w
w∈Rd
Thus, like linear regression (2.7), also ridge regression minimizes a convex
quadratic function. A main difference between linear regression (2.7) and
ridge regression (for λ > 0) is that the matrix Q in (2.20) is guaranteed to
be invertible for any training set D. In contrast, the matrix Q in (2.7) for
linear regression might be singular for some training sets.2
2.7 Assignment
1. generate numpy arrays, whose r-th row holds the features x(r) and label
y (r) , respectively, of the r-th data point in the csv file.
2
Consider the extreme case where all features of each data point in the training set D
are zero.
14
2. Split the dataset into a training set and validation set. The size of the
training set should be 100.
5. Train and validate a linear model for different choices for the maximal
polynomial degree used in the previous feature augmentation step.
6. Using a fixed value for the polynomial degree for the feature augmen-
tation step, train and validate a linear model using ridge regression
(2.18) via the Ridge class. For each choice of λ in (2.18), determine the
resulting training error and validation error.
15
3 Lecture - “FL Design Principle”
Lecture 2 reviewed ML methods that use numeric arrays to store data and
model parameters. We have also discussed ERM as a design principle or prac-
tical ML systems. This lecture will extend these concepts to FL applications.
Section 3.2 introduces empirical graphs to store collections of local datasets
and corresponding parameters of local models. Section 3.3 presents our main
design principle for FL systems. This principle uses the variation of local
model parameters across the edges of an empirical graph for the coupling (or
regularization) of the individual local models.
1
′ ′
D(i ) , w(i )
Ai,i′
D(i) , w(i)
Here, x(i,r) and y (i,r) denote, respectively, the features and the label of the
rth data point in the local dataset D(i) . Note that the size mi of the local
dataset might vary between different nodes i ∈ V.
It is convenient to collect the feature vectors x(i,r) and labels y (i,r) into a
feature matrix X(i) and label vector y(i) , respectively,
T T
X(i) := x(i,1) , . . . , x(i,mi ) , and y := y (1) , . . . , y (mi ) . (3.2)
The local dataset D(i) can then be represented compactly by the matrix
X(i) ∈ Rmi ×d and the vector y(i) ∈ Rmi .
Besides its local dataset D(i) , each node i ∈ G also carries a local model
H(i) . Whitin this course, we focus on local models that are parametrized
by local model parameters w(i) ∈ Rd , for i = 1, . . . , n. The usefulness of a
2
specific choice for the local model parameter w(i) is measured by a local loss
function Li w(i) , for i = 1, . . . , n.
Since the matrix L is psd, all its eigenvalues are real-valued and non-negative.
We denote its increasingly ordered eigenvalues by
0 ≤ λ1 ≤ λ2 . . . ≤ λn . (3.5)
3
Acccording to (3.4), we can measure the total variation of local model
parameters by stacking them into a single vector w ∈ Rnd and computing the
quadratic form wT Lw.
One immediate consequence of (3.4) is that any collection of identical
′
local model parameters, w(i) = w(i ) results in an eigenvector
T
(1) T (n) T
(3.6)
c= w ,..., w .
Here, m = (1/n) i=1 w(i) is the average of all local model parameters. The
Pn
2
quantity ni=1 w(i) − m 2 has a geometric interpretation: It is the squared
P
4
1 2 1 2
3 4 3 4
T T
T
, for some a ∈ R d
⊆ Rdn . (3.8)
a ,...,a
The edge weights are chosen A′i,i′ = Ai,i′ for any edge {i, i′ } ∈ E and A′i,i′ = 0
otherwise.
Note that the undirected edges E of an empirical graph encode a symmetric
notion of similarity between local datasets: If the local dataset D(i) at node i
′
is similar to the local dataset D(i ) at node i′ , i.e., {i, i′ } ∈ E, then also the
′
local dataset D(i ) is similar to the local dataset D(i) .
5
3.3 Generalized Total Variation Minimization
Consider data with empirical graph G whose nodes i ∈ V carry local datasets
D(i) and local model parametrized by the vector w(i) . To learn these parameter
vectors, we try to minimize their local loss and at the same time enforce
a small total variation. The optimal balance is obtained is via solving the
following opimization problem, which we refer to as generalized total variation
(GTV) minimization,
n
X X ′ 2
b (i) Li w(i) + λ Ai,i′ w(i) − w(i ) (GTVMin).
w i=1
∈ argmin
{w(i) } 2
i∈V i,i′ ∈V
(3.9)
Note that GTVMin is an instance of RERM: The regularizer is the total
variation of local model parameters over weighted edges Ai,i′ of the empirical
graph. Clearly, the empirical graph is an important design choice for GTVMin-
based methods. This choice can be guided by computational aspects and
statistical aspects of GTVMin-based FL systems.
Some application domains allow to leverage domain expertise to guess
a useful choice for the empirical graph. If local datasets are generated at
different geographic locations, we might use nearest neighbor graphs based
on geodesic distances between data generators (e.g., FMI weather stations).
Lecture 7 will also discuss graph learning methods that determine edge weights
Ai,i′ in a fully data-driven fashion.
Let us now consider the special case of GTVMin with local models being
a linear model. For each node i ∈ V of the empirical graph, we want to learn
T
the parameters w(i) of a linear hypothesis h(i) (x) := w(i) x. We measure
6
the quality of the weigths via the average squared error loss
mi 2
(i) T (i,r)
X
(i) (i,r)
Li w := (1/mi ) y − w x
r=1
(3.2) 2
= (1/mi ) y(i) − X(i) w(i) 2
. (3.10)
n
X 2 X ′ 2
b (i) y(i) −X(i) w(i) Ai,i′ w(i) −w(i )
w i=1
∈ argmin (1/mi ) 2
+λ .
{w(i) } i∈V 2
i,i′ ∈V
(3.11)
n
X 2
b (i) (1/mi ) y(i) −X(i) w(i) +λwT Lw. (3.12)
w i=1
∈ argmin 2
n
b (i)
w=stack w i∈V
i=1
T T
with Q(i) = (1/mi ) X(i) X(i) , and q(i) := (−2/mi ) X(i) y(i) .
Thus, like linear regression (2.7) and ridge regression (2.20), also GTVMin
(3.12) (for local linear models H(i) ) minimizes a convex quadratic function,
min wT Qw + qT w. (3.14)
w
7
Here, we used the psd matrix
(1)
Q 0 ··· 0
(2)
···
0 Q 0
+λL⊗I with Q(i) := (1/mi ) X(i) T X(i) (3.15)
Q :=
.. .. ... .
. . ..
(n)
0 0 ··· Q
T T T T
q := q(1) , . . . , q(n) , with q(i) := (−2/mi ) X(i) y(i) . (3.16)
2
proxL,ρ (w) := argmin L(w′ ) + (ρ/2) ∥w − w′ ∥2 for some ρ > 0.
w′
Some authors refer to functions L for which proxL,ρ (w) can be computed
easily as simple or proximable [32]. GTVMin with proximable loss functions
can be solved quite efficiently via proximal algorithms [33].
Besides influencing the choice of optimization method, the design choices
underlying GTVMin also determine the amount of computation needed by a
given optimization method. For example, using an empirical graph with rela-
tively few edges (“sparse graphs”) typically results in a smaller computational
8
complexity. Indeed, Lecture 5 discusses GTVMin-based algorithms requiring
an amount of computation that is proportional to the number of edges in the
empirical graph.
Let us now consider the computational aspects of GTVMin (3.11) to
train local linear models. As discussed above, this instance is equivalent to
solving (3.14). Any solution w
b of (3.14) is characterized by the zero-gradient
condition
b = −(1/2)q,
Qw (3.17)
The empirical graph should contain sufficient number of edges between nodes
that carry statistically similar local datasets. This allows regularization
techniques to adaptively pool local datasets into clusters of (approximately)
homogeneous data (see Section 6.3).
Statistically, we want the loss function to favour local model parame-
ters that result in a robust and accurate trained local model, with model
parameters w
b (i) , for each node i ∈ V.
3.4 Assignment
9
connect each FMI station i to its nearest neighbors i′ . All edges {i, i′ } ∈ E
have the same edge weight Ai,i′ = 1
For each station i ∈ V, you need to learn the single parameter w(i) ∈ R
of a hypothesis h(x) = w(i) that predicts the temperature. We measure
the quality of a hypothesis by the average squared error loss Li w(i) =
2
(1/mi ) m (i,r)
−w(i) . You should learn the parameters w(i) via balancing
P i
r=1 y
10
4 Lecture - “Gradient Methods”
Lecture 3 introduced GTVMin as a central design principle for FL methods.
Several important instances of GTVMin amount to minimizing a smooth
objective function over (a subset of) the parameter space Rd . This lecture
discusses Gradient-based methods which is a widely-used family of iterative
algorithms for minimizing a smooth function. These methods share a core idea:
approximate the objective function locally using its gradient at the current
choice for the model parameters. Lecture 5 discusses FL algorithms obtained
from direct application of gradient-based methods to solving GTVMin.
1
(2.3) of linear regression. A gradient step updates a current choice for w(curr)
along the opposite direction of the gradient ∇f (w) at the current choice,
The gradient step (4.1) involves the factor α which is referred to as step-size
or learning rate.
The usefulness of gradient-based methods depends crucially on the diffi-
culty of evaluating the gradient. Evaluating the gradient of a given function
has been made convenient by modern software libraries (such as PyTorch) that
provide quite efficient methods for computing the gradient (autograd/back-
prop/...). However, besides the actual compuation of the gradient, it might
be challegning to gather the required data points which define the objective
function (empirical risk).
Algorithm 1 summarizes the most basic instance of gradient-based methods.
2
4.3 Hyperparameter of gradient-based methods
4.5 Constraints
4.6 Assignment
3
5 Lecture - “FL Algorithms”
This lecture applies the gradient-based methods from Lecture 4 to solve
GTVMin from Lecture 3. The resulting FL algorithms can be implemented
by message passing over the edges of the empirical graph.
• be able to derive the gradient for GTVMin with local linear models.
5.4 Assignment
1
6 Lecture - “FL Main Flavors”
Lecture 3 discussed GTVMin as a main design principle for FL algorithms
that have been obtained in Lecture 5 by applying some of the gradient-based
methods from Lecture 4. This lecture discusses some important special cases
of GTVMin that are obtained for specific choies for the underyling empirical
graph.
After this lecture, you should know about the following main flavours of FL:
• centralized FL
• clustered FL
• horizontal FL
• vertical FL
6.2 Centralized FL
6.3 Clustered FL
Many applications generate local datasets which do not carry sufficient statis-
tical power to guide learning of model parameters w(i) (see Section ??). As
a case in point, consider a local dataset D(i) of the form (3.1), with feature
vectors x(r) ∈ Rd with mi ≪ d. We would like to learn the parameter vector
w(i) of a linear hypothesis h(x) = xT w(i) .
1
6.4 Horizontal FL
6.5 Vertical FL
6.6 Assignment
2
7 Lecture - “Graph Learning”
Lecture 3 discussed GTVMin as a main design principle for FL algorithms.
The computational and statistical properties of these algorithms crucially
depend on the choice for the empirical graph. In some applications, domain
expertise can guide the choice for the empirical graph. However, it might
be useful to learn the empirical graph in a data-driven fashion. This lecture
discusses some of these graph learning techniques.
1
The discrepancy (or lack of similarity) between local datasets D(i) and
′
D(i ) could then be defined via the Euclidean distance
′ ′
d(i,i ) := w(i) − w(i ) ,
2
′
min Ai,i′ d(i,i ) . (7.1)
Ai,i′
The constraints (7.2) require that each node i is connected with other nodes
using total edge weight i′ ̸=i Ai,i′ = dmax . We can intepret the parameter
P
2
learning principle,
7.4 Assignment
3
8 Lecture - “Trustworthy FL”
This lecture discusses some key requirements for trustworthy AI that have been
put forward by the European Union. We will also see how these requirements
might guide the design choices for GTVMin. Our focus will be on the four
design criteria: robustness, privacy protection and explainablity. This lecture
discusses the robustness and explainability of basic linear regression that we
encountered in Lecture 2. We will see that regularization techniques allow to
navigate robustness-explainability-accuracy trade-offs.
We can use the upper bound (2.16) to study the effect of perturbing
features and labels of data points.
1
9 Lecture - “Privacy-Protection in FL”
FL is inherently based on sharing information. Without any information
sharing between the owners (or generators) of local datasets, FL is not
possible.
9.2 Assignment
1
10 Lecture - “Data and Model Poisoning in FL”
This lecture discusses the robustness of FL systems against data poisoning
which are a specific type of cyber attackes.
10.4 Assignment
1
Glossary
activation function Each artificial neuron within an ANN consists of an
activation function that maps the inputs of the neuron to a single output
value. In general, an activation function is a non-linear map of the
weighted sum of neuron inputs (this weighted sum is the activation of
the neuron). 13
Bayes risk We use the term Bayes risk as a synonym for the risk or expected
loss of a hypothesis. Some authors reserve the term Bayes risk for the
risk of a hypothesis that achieves minimum risk, such a hypothesis being
referred to as a Bayes estimator [29]. 1
1
bias Consider some unknown quantity w̄, e.g., the true weight in a linear
model y = w̄x + e relating feature and label of a data point. We might
use an ML method (e.g., based on ERM) to compute an estimate ŵ
for the w̄ based on a set of data points that are realizations of RVs.
The (squared) bias incurred by the estimate ŵ is typically defined
2
as B 2 := E{ŵ} − w̄ . We extend this definition to vector-valued
2
b − w 2 . 12
quantities using the squared Euclidean norm B 2 := E{w}
2
T
. 9, 13, 15
E x−E x x−E x
data point A data point is any object that conveys information [36]. Data
points might be students, radio signals, trees, forests, images, RVs, real
numbers or proteins. We characterize data points using two types of
properties. One type of property is referred to as a feature. Features
are properties of a data point that can be measured or computed in an
automated fashion. Another type of property is referred to as labels.
The label of a data point represents some higher-level fact (or quantity
of interest). In contrast to features, determining the label of a data point
typically requires human experts (domain experts). Roughly speaking,
ML aims at predicting the label of a data point based solely on its
features. 1–14, 16, 17, 19–21, 23, 24
3
dataset With a slight abuse of notation we use the terms “dataset“ or “set
of data points” to refer to an indexed list of data points z(1) , z(2) , . . ..
Thus, there is a first data point z(1) , a second data point z(2) and so on.
Strictly speaking a dataset is a list and not a set [39]. By using indexed
lists of data points we avoid some of the challenges arising in concept
of an abstract set. 2, 3, 5, 7, 9, 10, 12–15, 17
4
differentiable A function f : Rd → R is differentiable if it has a gradient
∇f (x) everywhere (for every x ∈ Rd ) [5]. 1, 8, 9, 13, 21, 22
5
empirical risk The empirical risk of a given hypothesis on a given set of
data points is the average loss of the hypothesis computed over all data
points in that set. 2, 6, 12, 15, 22
estimation error Consider data points with feature vectors x and label y.
In some applications we can model the relation between features and
label of a data point as y = h̄(x) + ε. Here we used some true hypothesis
h̄ and a noise term ε which might represent modelling or labelling errors.
The estimation error incurred by a ML method that learns a hypothesis
h, e.g., using ERM, is defined as b
b h − h̄. For a parametrized hypothesis
space, consisting of hypothesis maps that are determined by a parameter
vector w, we define the estimation error in terms of parameter vectors
as ∆w = w
b − w. first 7, 8
feature A feature of a data point is one of its properties that can be measured
or computed in an automated fashion. For example, if a data point is a
6
bitmap image, then we could use the red-green-blue intensities of its
pixels as features. Some widely used synonyms for the term feature
are “covariate”,“explanatory variable”, “independent variable”, “input
(variable)”, “predictor (variable)” or “regressor” [42–44]. However, this
book makes consequent use of the term features for low-level properties
of data points that can be measured easily. 1–7, 9–16, 19–21
feature map A map that transforms the original features of a data point
into new features. The so-obtained new features might be preferable
over the original features for several reasons. For example, the shape of
datasets might become simpler in the new feature space, allowing to
use linear models in the new features. Another reason could be that the
number of new features is much smaller which is preferable in terms of
avoiding overfitting. The special case of feature maps that deliver two
numeric features are particulary useful for data visualization. Indeed,
we can then depict data points in a scatterplot by using these two
features as the coordinates of a data point. 13
7
federated learning (FL) Federated learning is an umbrella term for ML
methods that train models in a collaborative fashion using decentralized
data and computation. 1–9, 16, 17
8
gradient-based method Gradient-based methods are iterative algorithms
for finding the minimum (or maximum) of a differentiable objective func-
tion of the model parameters. These algorithms construct a sequence
of approximations to an optimal choice for model parameters that re-
sults in a minimum objective function value. As their name indicates,
gradient-based methods use the gradients of the objective function eval-
uated during previous iterations to construct new (hopefully) improved
model parameters. 1–5, 8, 11, 22
9
design choice of the hypothesis space should take into account available
computational resources and statistical aspects. If the computational
infrastructure allows for efficient matrix operations, and there is a (ap-
proximately) linear relation between features and label, a useful choice
for the hypothesis space might be the linear model. 1, 4, 5, 10–14, 16,
20, 23, 24
label A higher level fact or quantity of interest associated with a data point.
If a data point is an image, its label might be the fact that it shows a cat
(or not). Some widely used synonyms for the term label are "response
variable", "output variable" or "target" [42–44]. 1–3, 5, 6, 9–14, 16, 19,
21, 23
10
label space Consider a ML application that involves data points charac-
terized by features and labels. The label space is constituted by all
potential values that the label of a data point can take on. Regres-
sion methods, aiming at predicting numeric labels, often use the label
space Y = R. Binary classification methods use a label space that
consists of two different elements, e.g., Y = {−1, 1}, Y = {0, 1} or
Y = {“cat image”, ”no cat image”} 9
law of large numbers The law of large numbers refers to the convergence
of the average of an increasing (large) number of i.i.d. RVs to the mean
(or expectation) of their common probability distribution. Different
instances of the law of large numbers are obtained using different notions
of convergence. 11
11
or improved in each iteration. A prime example for such a parameter
is the step size used in GD. Some authors use the term learning rate
mostly as a synonym for the step size of (a variant of) GD 1–3, 11, 22
least absolute shrinkage and selection operator (Lasso) The least ab-
solute shrinkage and selection operator (Lasso) is an instance of struc-
tural risk minimization (SRM) for learning the weights w of a linear map
h(x) = wT x. The Lasso minimizes the sum consisting of an average
squared error loss (as in linear regression) and the scaled ℓ1 norm of the
weight vector w. 20
linear model We use the term linear model in a very specific sense. In
particular, a linear model is a hypothesis space which consists of all
linear maps,
H(d) := h(x) = wT x : w ∈ Rd . (10.1)
12
linear regression Linear regression aims at learning a linear hypothesis map
to predict a numeric label based on numeric features of a data point.
The quality of a linear hypothesis map is measured using the average
squared error loss incurred on a set of labeled data points (which we
refer to as training set). 1–4, 7, 8, 12–14
local model Consider a collections of local datasets that are assigned to the
nodes of an empirical graph. A local model H(i) is a hypothesis space
that is assigned to a node i ∈ V. Different nodes might be assigned
′
different hypothesis spaces, i.e., in general H(i) ̸= H(i ) for different
nodes i, i′ ∈ V. 1–6, 9, 14
loss With a slight abuse of language, we use the term loss either for the loss
function itself or for its value for a specific pair of a data point and a
hypothesis. 1, 5–15, 19–23
L : X × Y × H → R+ : x, y , h 7→ L ((x, y), h)
13
which assigns a pair consisting of a data point, with features x and label
y, and a hypothesis h ∈ H the non-negative real number L ((x, y), h).
The loss value L ((x, y), h) quantifies the discrepancy between the true
label y and the predicted label h(x). Smaller (closer to zero) values
L ((x, y), h) mean a smaller discrepancy between predicted label and
true label of a data point. Figure 6 depicts a loss function for a given
data point, with features x and label y, as a function of the hypothesis
h ∈ H. 1–3, 5, 7–9, 13, 14, 20, 21
L ((x, y), h)
hypothesis h
Figure 6: Some loss function L ((x, y), h) for a fixed data point, with feature
vector x and label y, and varying hypothesis h. ML methods try to find
(learn) a hypothesis that incurs minimum loss.
model We use the term model as a synonym for hypothesis space 1, 3–5, 7,
9–12, 16, 24
14
tinuous RV x ∈ Rd [3, 50, 51]. This family is paramtrized by the mean
m and covariance matrix C of x. If the covariance matrix is invertible,
the probability distribution of x is
T −1
p(x) ∝ exp − (1/2) x − m C x − m .
15
parameters The parameters of a ML model are tunable (learnable or ad-
justable) quantities that allow to choose between different hypothesis
maps. For example, the linear model H := {h : h(x) = w1 x + w2 }
consists of all hypothesis maps h(x) = w1 x + w2 with a particular choice
for the parameters w1 , w2 . Another example of parameters are the
weights assigned to the connections of an ANN. 1, 3, 5–7, 9, 10
privacy leakage Consider a (ML or FL) system that processes a local dataset
D(i) and shares data, such as the predictions obtained for new data
points, with other parties. Privacy leakage arises if the shared data
carries information about a private (sensitive) feature of a data point
16
(which might be a human) of D(i) . The amount of privacy leakage can
be measured via mutual information using a probabilistic model for the
local dataset. 17
17
of a RV. The probability distribution of a binary RV y ∈ {0, 1} is fully
specified by the probabilities p(y = 0) and p(y = 1) = 1−p(y = 0) . The
f (w) = wT Qw + qT w + a,
18
with some matrix Q ∈ Rd×d , vector q ∈ Rd and scalar a ∈ R. 7, 8, 14
19
compared to the average loss on the training set. 1, 4, 7, 9, 10, 12, 13,
19, 20, 22
training set might differ from its prediction errors on data points outside
the training set. Ridge regression uses the regularizer R h := ∥w∥22
for linear hypothesis maps h(w) (x) := wT x [4, Ch. 3]. The least
absolute shrinkage and selection operator (Lasso) uses the regularizer
R h := ∥w∥1 for linear hypothesis maps h(w) (x) := wT x [4, Ch. 3].
6, 8, 9, 13
ridge regression Ridge regression learns the parameter (or weight) vector w
of a linear hypothesis map h(w) (x) = wT x. The quality of a particular
choice for the parameter vector w is measured by the sum of two
components. The first components is the average squared error loss
incurred by h(w) on a set of labeled data points (the training set). The
second component is the scaled squared Euclidean norm λ∥w∥22 with
a regularization parameter λ > 0. It can be shown that the effect
of adding to λ∥w∥22 to the average squared error loss is equivalent to
replacing the original data points by an ensemble of realizations of a
RV centered around these data points. 7, 12–15, 20
20
as the realizations of i.i.d. RVs, also the L ((x, y), h) becomes the
realization of a RV. Using such an i.i.d. assumption allows to define the
risk of a hypothesis as the expected loss E L ((x, y), h) . Note that
the risk of h depends on both, the specific choice for the loss function
and the probability distribution of the data points. 1, 6, 9, 15, 19
1, 3, 5, 24
squared error loss The squared error loss measures the prediction error of
a hypothesis h when predicting a numeric label y ∈ R from the features
x of a data point. It is defined as
2
L ((x, y), h) := y − h(x) . (10.2)
|{z}
=ŷ
21
step size Many ML methods use iterative optimization methods (such as
gradient-based methods) to construct a sequence of increasingly accurate
hypothesis maps h(1) , h(2) , . . .. The rth iteration of such an algorithm
starts from the current hypothesis h(r) and tries to modify it to obtain
an improved hypothesis h(r+1) . Iterative algorithms often use a step
size (hyper-) parameter. The step size controls the amount by which a
single iteration can change or modify the current hypothesis. Since the
overall goal of such iteration ML methods is to learn a (approximately)
optimal hypothesis we refer to a step size parameter also as a learning
rate. 1
22
perturbations of the data points in the training set. 12, 20
training error The average loss of a hypothesis when predicting the labels
of data points in a training set. We sometimes refer by training error
also the minimum average loss incurred on the training set by any
hypothesis out of a hypothesis space. 9–12, 15, 23
training set A set of data points that is used in ERM to learn a hypothesis
ĥ. The average loss of ĥ on the training set is referred to as the training
error. The comparison between training error and validation error of ĥ
allows to diagnose ML methods and informs how to improve them (e.g.,
using a different hypothesis space or collecting more data points). 3,
5–16, 19, 20, 22, 23
validation Consider a hypothesis ĥ that has been learn via ERM on some
training set D. Validation refers to the practice of trying out a hypothesis
ĥ on a validation set that consists of data points that are not contained
in the training set D. 1, 8
validation set A set of data points that have not been used as training set
in ERM to learn a hypothesis b
h. The average loss of b
h on the validation
set is referred to as the validation error and used to diagnose the ML
method (see [4, Sec. 6.6.]). The comparison between training error
23
and validation error can inform directions for improvements of the ML
method (such as using a different hypothesis space). 8, 9, 11, 12, 15, 23
12
weights We use the term weights synonymously for a finite set of parameters
within a model. For example, the linear model consists of all linear
T
maps h(x) = wT x that read in a feature vector x = x1 , . . . , xd of a
data point. Each specific linear map is characterized by specific choices
T
for the parameters for weights w = w1 , . . . , wd . 13
∇f w
b =0⇔f w
b = min f (w).
w∈Rd
24
References
[1] W. Rudin, Real and Complex Analysis, 3rd ed. New York: McGraw-Hill,
1987.
[2] G. H. Golub and C. F. Van Loan, Matrix Computations, 3rd ed. Balti-
more, MD: Johns Hopkins University Press, 1996.
[4] A. Jung, Machine Learning: The Basics, 1st ed. Springer Singapore,
Feb. 2022.
[8] H. Ates, A. Yetisen, F. Güder, and C. Dincer, “Wearable devices for the
detection of covid-19,” Nature Electronics, vol. 4, no. 1, pp. 13–14, 2021.
[Online]. Available: https://doi.org/10.1038/s41928-020-00533-1
25
[9] H. Boyes, B. Hallaq, J. Cunningham, and T. Watson, “The
industrial internet of things (iiot): An analysis framework,”
Computers in Industry, vol. 101, pp. 1–12, 2018. [Online]. Available:
https://www.sciencedirect.com/science/article/pii/S0166361517307285
[10] S. Cui, A. Hero, Z.-Q. Luo, and J. Moura, Eds., Big Data over Networks.
Cambridge Univ. Press, 2016.
26
[15] Y. Cheng, Y. Liu, T. Chen, and Q. Yang, “Federated learning for privacy-
preserving ai,” Communications of the ACM, vol. 63, no. 12, pp. 33–36,
Dec. 2020.
27
Information Processing Systems (NeurIPS 2020), Vancouver, Canada,
2020.
28
[30] A. Esteva, B. Kuprel, R. A. Novoa, J. Ko, S. M. Swetter, H. M. Blau,
and S. Thrun, “Dermatologist-level classification of skin cancer with deep
neural networks,” Nature, vol. 542, 2017.
[34] S. Chepuri, S. Liu, G. Leus, and A. Hero, “Learning sparse graphs under
smoothness prior,” in Proc. of the IEEE Int. Conf. on Acoustics, Speech
and Signal Processing, 2017, pp. 6508–6512.
29
[37] X. Liu, H. Li, G. Xu, Z. Chen, X. Huang, and R. Lu, “Privacy-enhanced
federated learning against poisoning adversaries,” IEEE Transactions on
Information Forensics and Security, vol. 16, pp. 4574–4588, 2021.
30
[47] A. Jung and P. Nardelli, “An information-theoretic approach to person-
alized explainable machine learning,” IEEE Sig. Proc. Lett., vol. 27, pp.
825–829, 2020.
[50] R. Gray, Probability, Random Processes, and Ergodic Properties, 2nd ed.
New York: Springer, 2009.
[52] P. Billingsley, Probability and Measure, 3rd ed. New York: Wiley, 1995.
31