0% found this document useful (0 votes)
88 views92 pages

FL LectureNotes

This document contains lecture notes for a course on federated learning (FL). The course will discuss theory and algorithms for training machine learning models using decentralized datasets without sharing the raw data. Key concepts covered include empirical graphs to represent local datasets and relationships between them, and generalized total variation minimization as an optimization approach for FL. Gradient-based algorithms will be applied to solve the optimization problem in a distributed manner. The course will also address privacy, security, and trustworthiness challenges in FL.

Uploaded by

Anastasia S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
88 views92 pages

FL LectureNotes

This document contains lecture notes for a course on federated learning (FL). The course will discuss theory and algorithms for training machine learning models using decentralized datasets without sharing the raw data. Key concepts covered include empirical graphs to represent local datasets and relationships between them, and generalized total variation minimization as an optimization approach for FL. Gradient-based algorithms will be applied to solve the optimization problem in a distributed manner. The course will also address privacy, security, and trustworthiness challenges in FL.

Uploaded by

Anastasia S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 92

Lectures Notes

CS-E4740 Federated Learning

Dipl.-Ing. Dr.techn. Alexander Jung∗

January 21, 2024

Abstract

This course discusses theory and algorithms for federated learning


(FL) from collections of local datasets. These FL methods exploit simi-
larities between local datasets to train local (or personalized) models
collaboratively. Two core mathematical objects of this course are em-
pirical graphs and generalized total variation minimization (GTVMin).
We use empirical graphs to store and process local datasets and pa-
rameters of local models. GTVMin formulates FL as an instance of
regularized empirical risk minimization. As the regularizer, we use
quantitative measures for the variation of local models across the edges
of the empirical graph. We can obtain practical FL systems by applying
distributed optimization methods to solve GTVMin.


AJ is currently Associate Professor for Machine Learning at Aalto University (Finland).
This work has been partially funded by the Academy of Finland (decision numbers 331197,
331197) and the European Union (grant number 952410).

1
Contents
1 Lecture - “Welcome and Intro” 1
1.1 Learning Goals . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Prerequisites . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Related Courses . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.5 Main Goal of the Course . . . . . . . . . . . . . . . . . . . . . 6
1.6 Outline of the Course . . . . . . . . . . . . . . . . . . . . . . . 7
1.7 Assignments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.8 Student Project . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.9 Schedule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.10 Ground Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2 Lecture - “ML Basics” 1


2.1 Learning Goals . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2.2 Three Components and a Design Principle . . . . . . . . . . . 1
2.3 Computational Aspects of ERM . . . . . . . . . . . . . . . . . 4
2.4 Statistical Aspects of ERM . . . . . . . . . . . . . . . . . . . . 6
2.5 Validation and Diagnosis of ML . . . . . . . . . . . . . . . . . 8
2.6 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.7 Assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3 Lecture - “FL Design Principle” 1


3.1 Learning Goals . . . . . . . . . . . . . . . . . . . . . . . . . . 1
3.2 Empirical Graphs and Their Laplacian . . . . . . . . . . . . . 1
3.3 Generalized Total Variation Minimization . . . . . . . . . . . 6

2
3.3.1 Computational Aspects of GTVMin . . . . . . . . . . . 8
3.3.2 Statistical Aspects of GTVMin . . . . . . . . . . . . . 9
3.4 Assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

4 Lecture - “Gradient Methods” 1


4.1 Learning Goals . . . . . . . . . . . . . . . . . . . . . . . . . . 1
4.2 The Basic Idea of the Gradient Step . . . . . . . . . . . . . . . 1
4.3 Hyperparameter of gradient-based methods . . . . . . . . . . . 3
4.4 Perturbed Gradient Step . . . . . . . . . . . . . . . . . . . . . 3
4.5 Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
4.6 Assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

5 Lecture - “FL Algorithms” 1


5.1 Learning Goals . . . . . . . . . . . . . . . . . . . . . . . . . . 1
5.2 Gradient Step for GTVMin . . . . . . . . . . . . . . . . . . . 1
5.3 Message Passing Implementation . . . . . . . . . . . . . . . . 1
5.4 Assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

6 Lecture - “FL Main Flavors” 1


6.1 Learning Goals . . . . . . . . . . . . . . . . . . . . . . . . . . 1
6.2 Centralized FL . . . . . . . . . . . . . . . . . . . . . . . . . . 1
6.3 Clustered FL . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
6.4 Horizontal FL . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
6.5 Vertical FL . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
6.6 Assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

7 Lecture - “Graph Learning” 1

3
7.1 Learning Goals . . . . . . . . . . . . . . . . . . . . . . . . . . 1
7.2 Measuring (Dis-)Similarity Between Datasets . . . . . . . . . . 1
7.3 Graph Learning Methods . . . . . . . . . . . . . . . . . . . . . 2
7.4 Assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

8 Lecture - “Trustworthy FL” 1


8.1 Learning Goals . . . . . . . . . . . . . . . . . . . . . . . . . . 1

9 Lecture - “Privacy-Protection in FL” 1


9.1 Learning Goals . . . . . . . . . . . . . . . . . . . . . . . . . . 1
9.2 Assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

10 Lecture - “Data and Model Poisoning in FL” 1


10.1 Learning Goals . . . . . . . . . . . . . . . . . . . . . . . . . . 1
10.2 Data Poisoning . . . . . . . . . . . . . . . . . . . . . . . . . . 1
10.3 Model Poisoning . . . . . . . . . . . . . . . . . . . . . . . . . 1
10.4 Assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

Glossary 1

4
Lists of Symbols

Sets and Functions

a∈A This statement indicates that the object a is an element of the set A.

a := b This statement defines a to be shorthand for b.

|A| The cardinality (number of elements) of a finite set A.

A⊆B A is a subset of B.

A⊂B A is a strict subset of B.

N The set of natural numbers 1, 2, . . ..

R The set of real numbers x [1].

R+ The set of non-negative real numbers x ≥ 0.

R++ The set of positive real numbers x > 0.

A function (map) that accepts any element a ∈ A from a set


A as input and delivers a well-defined element h(a) ∈ B of
a set B. The set A is the domain of the function h and the
h(·) : A → B : a 7→ h(a)
set B is the codomain of h. ML aims at finding (or learning)
a function h (“hypothesis”) that reads in the features x of a
data point and delivers a prediction h(x) for its label y.

5
{0, 1} The binary set that consists of the two real numbers 0 and 1.

[0, 1] The closed interval of real numbers x with 0 ≤ x ≤ 1.

argmin f (w) The set of minimizers for a real-valued function f (w).

log a The logarithm of the positive number a ∈ R++ .

6
Matrices and Vectors
A generalized identity matrix with l rows and d columns. The
Il×d entries of Il×d ∈ Rl×d are equal to 1 along the main diagonal
and equal to 0 otherwise.

A square identity matrix whose shape should be clear from


I
the context.
T
The set of vectors x = x1 , . . . , xd consisting of d real-valued
Rd
entries x1 , . . . , xd ∈ R.

A vector of length d. The jth entry of the vector is denoted


x = x1 , . . . , x d ) T
xj .
T
The Euclidean (or “ℓ2 ”) norm of the vector x = x1 , . . . , xd ∈
∥x∥2 qP
Rd given as ∥x∥2 := j=1 xj .
d 2

Some norm of the vector x ∈ Rd [2]. Unless specified otherwise,


∥x∥
we mean the Euclidean norm ∥x∥2 .

The transpose of a vector x that is considered a single column


xT
matrix. The transpose is a single-row matrix x1 , . . . , xd .


T
The transpose of a matrix X ∈ Rm×d . A square real-valued
X
matrix X ∈ Rm×m is called symmetric if X = XT .
T
0 = 0, . . . , 0 A vector of zero entries.

7
T The vector of length d + d′ obtained by concatenating the
vT , wT ′
entries of vector v ∈ Rd with the entries of w ∈ Rd .

The span of a matrix B ∈ Ra×b , which is the subspace of all


linear combinations of columns of B, span{B} = Ba : a ∈

span{B}
Rb ⊆ Ra .

Sd+ The set of all positive semi-definite (psd) matrices of size d × d.

det (C) The determinant of the matrix C.

8
Probability Theory

The expectation of a function f (z) of a RV z whose probability


Ep {f (z)} distribution is p(z). If the probability distribution is clear
from context we just write E{f (z)}.

A (joint) probability distribution of a RV whose realizations


p(x, y)
are data points with features x and label y.

A conditional probability distribution of a RV x given the


p(x|y)
value of another RV y [3, Sec. 3.5].

A parametrized probability distribution of a RV x. The proba-


bility distribution depends on a parameter vector w. For exam-

p(x; w) ple, p(x; w) could be a multivariate normal distribution with


the parameter vector w given by the
 entries of te mean vector

T
E{x} and the covariance matrix E x − E{x} x − E{x} .


The probability distribution of a scalar normal (“Gaussian”)


N (µ, σ 2 ) RV x ∈ R with mean (or expectation) µ = E{x} and variance
σ 2 = E (x − µ)2 .


The multivariate normal distribution of a vector-valued Gaus-


N (µ, C) sian RV x ∈ Rd with mean (or expectation) µ = E{x} and
T
covariance matrix C = E x − µ x − µ .
 

9
Machine Learning

r An index r = 1, 2, . . . , that enumerates data points.

m The number of data points in (the size of) a dataset.

A dataset D = {z(1) , . . . , z(m) } is a list of individual data


D
points z(r) , for r = 1, . . . , m.

d Number of features that characterize a data point.

The jth feature of a data point. The first feature of a given


xj
data point is denoted x1 , the second feature x2 and so on.
T
The feature vector x = x1 , . . . , xd of a data point whose
x
entries are the individual features of a data point.

The feature space X is the set of all possible values that the
X
features x of a data point can take on.

Beside the symbol x, we sometimes use z as another symbol


to denote a vector whose entries are features of a data point.
z
We need two different symbols to distinguish between “raw”
or “original” and learnt features [4, Ch. 9].

x(r) The feature vector of the rth data point within a dataset.

(r)
xj The jth feature of the rth data point within a dataset.

B A mini-batch (subset) of randomly chosen data points.

B The size of (the number of data points in) a mini-batch.


10
y The label (quantity of interest) of a data point.

y (r) The label of the rth data point.

x(r) , y (r) The features and label of the rth data point.


The label space Y of a ML method consists of all potential


label values that a data point can have. We often use label
spaces that are larger than the set of different label values
arising in a give dataset (e.g., a training set). We refer to
Y ML problems (methods) using a numeric label space, such as
Y = R or Y = R3 , as regression problems (methods). ML
problems (methods) that use a discrete label space, such as
Y = {0, 1} or Y = {“cat”, “dog”, “mouse”} are referred to as
classification problems (methods).

α learning rate (step-size) used by gradient-based methods.

A hypothesis map that reads in features x of a data point and


h(·)
delivers a prediction ŷ = h(x) for its label y.

Given two sets X and Y, we denote by Y X the set of all


YX
possible hypothesis maps h : X → Y.

A hypothesis space or model used by a ML method. The


H hypothesis space consists of different hypothesis maps h :
X → Y between which the ML method has to choose .

deff (H) The effective dimension of a hypothesis space H.


11
The squared bias of a learnt hypothesis ĥ delivered by a ML
algorithm that is fed with data points which are modelled as
B2
realizations of RVs. If data is modelled as realizations of RVs,
also the delivered hypothesis ĥ is the realization of a RV.

The variance of the (parameters of the) hypothesis delivered


by a ML algorithm. If the input data for this algorithm is
V
interpreted as realizations of RVs, so is the delivered hypothesis
a realization of a RV.

The loss incurred by predicting the label y of a data point


using the prediction ŷ = h(x). The prediction ŷ is obtained
L ((x, y), h)
from evaluating the hypothesis h ∈ H for the feature vector x
of the data point.

The validation error of a hypothesis h, which is its average


Ev
loss incurred over a validation set.

 The empirical risk or average loss incurred by the predictions


L
b h|D
of hypothesis h for the data points in the dataset D.

The training error of a hypothesis h, which is its average loss


Et
incurred over a training set.

A discrete-time index t = 0, 1, . . . used to enumerate a sequence


t
to sequential events (“time instants”).

An index that enumerates learning tasks within a multi-task


t
learning problem.

12
A regularization parameter that controls the amount of regu-
λ
larization.

The jth eigenvalue (sorted either ascending or descending)


of a psd matrix Q. We also use the shorthand λj if the

λj Q
corresponding matrix is clear from context.

The activation function used by an artificial neuron within an


σ(·)
artificial neural network (ANN).

Rŷ A decision region within a feature space.

T
A parameter vector w = w1 , . . . , wd whose entries are
parameters of a model. These parameters could be feature
w
weights in linear maps, the weights in ANNs or the thresholds
used for splits in decision trees.

A hypothesis map that involves tunable model parameters


h(w) (·) T
w1 , . . . , wd , stacked into the vector w = w1 , . . . , wd .

The gradient of a differentiable real-valued function f : Rd →


∇f (w) ∂f T
R is the vector ∇f (w) = ∂w ∂f
∈ Rd [5, Ch. 9].

1
, . . . , ∂w d

A feature map ϕ : X → X ′ : x 7→ x′ := ϕ x ∈ X ′ .

ϕ(·)

13
Federated Learning

Empirical graph whose nodes i ∈ V carry local datasets and


G = (V, E)
local models.

A node in the empirical graph that represents a local dataset


and a corresponding local model. It might also be useful to
i∈V
think of node i as a small computer that can collect data and
execute computations to train ML models.

D(i) The local dataset D(i) at node i ∈ V of an empirical graph.

The number of data points (sample size) contained in the local


mi
dataset D(i) at node i ∈ V.

N (i) The neighbourhood of the node i in an empirical graph.

x(i,r) The features of the r-th data point in the local dataset D(i) .

y (i,r) The label of the r-th data point in the local dataset D(i) .

The loss incurred by a “external” hypothesis h′ on a data point


L(d) x, h x , h′ x with features x and predicted label h x that is obtained from
  

some local hypothesis.

14
1 Lecture - “Welcome and Intro”
Welcome to the course CS-E4740 Federated Learning. This course can be
completed fully remote. Any on-site event will be recorded and made available
to students via this YouTube channel. The basic variant (5 credits) of this
course consists of lectures (schedule here) and corresponding coding assign-
ments (schedule here). We test your completion of the coding assignments
via quizzes (implemented on the MyCourses page). You can upgrade the
course to an extended variant (10 credits) by completing a student project
(see Section 1.8).

1.1 Learning Goals

This lecture offers

• introduction of course topic and positioning in wider curricula

• discussion of learning goals, assignments and student project

• overview of course schedule

1.2 Introduction

We are surrounded by devices (e.g. smarphones, wearables, sensors) that


generate decentralized collections of local datasets [6–10]. An application-
specific network structure relates these local datasets . For example, the
high-precision management of pandemics uses contact networks to relate local
datasets generated by patients. Network medicine relates data about diseases

1
via co-morbidity networks [11]. Social science uses notions of acquaintance to
relate data collected from be-friended individuals [12].
Federated learning (FL) is an umbrella term for distributed optimization
techniques to train machine learning (ML) models from decentralized collec-
tions of local datasets [13–17]. These methods carry out computations, such
as gradient steps (see Lecture 4), for ML model training at the location of
data generation. This design philosophy is different from a naive application
of ML techniques, which is first to collect all local datasets at a single location
(computer). We can then feed this pooled data into a conventional ML method
like linear regression.
The distributed training of ML models, at locations close to the actual
data generation, can be beneficial for several reasons [18]:

• Privacy. FL methods are appealing for applications involving sensitive


data (such as healthcare) as they do not require the exchange of raw
data but only model (parameter) updates [15, 16]. By exchanging only
model updates, FL methods are considered privacy-friendly in the sense
of not l eaking (too much) sensitive information that is contained in the
local datasets (see Lecture 9).

• Robustness. By relying on decentralized data and computation, FL


methods offer robustness (to some extent) against hardware failures
(such as “stragglers”) and data poisonings (see Lecture 10).

• Parallel Computing. Many ML systems are based on mobile networks,


consisting of humans equipped with smartphones. We can interpret a
mobile network as a parallel computer which is constituted by smart-

2
phones that can communicate via radio links. This parallel computer
allows to speed up computational tasks such as the computation of
gradients required to train ML models (see Lecture 4).

• Trading Computation against Communication. Consider a FL


application where local datasets are generated by low-complexity devices
at remote locations that cannot be easily accessed. The cost of commu-
nicating raw local datasets to some central unit (which then trains a
single global ML model) might be much higher than the computational
cost incurred by using the low-complexity devices to (partially) train
ML models [19].

• Personalization. FL can be used to train personalized ML models for


collections of local datasets, which might be generated by smartphones
(and their users) [20]. A key challenge for ensuring personalization is the
heterogeneity of local datasets [21, 22]. Indeed, the statistical properties
of different local datasets might vary significantly such they cannot be
well modelled as independent and identically distributed (i.i.d.). Each
local dataset induces a separate learning task that consists of learning
useful parameter values for a local model. This course discusses FL
methods to train personalized models via combining the information
carried in decentralized and heterogeneous data (see Lecture 6).

1.3 Prerequisites

The main mathematical structure used to study and design FL algorithms


is the Euclidean space Rd . We therefore expect some familiarity with the

3
algebraic and geometric structure of Rd . By algebraic structure, we mean
the (real) vector space obtained from the elements (“vectors”) in Rd along
with the usual definitions of vector addition and multiplication by scalars
in R [23, 24]. We will make heavy use of concepts from linear algebra to
represent and manipulate data and ML models.
The metric structure of Rd will be used to study the (convergence) be-
haviour of FL algorithms. In particular, we will study FL algorithms that
are obtained as fixed-point iterations of some non-linear operator on Rd
which depends on the data (distribution) and ML models used within a FL
system. A prime example for such a non-linear operator is the gradient step
of gradient-based methods (see Lecture 4). The computational properties
(such as convergence speed) of these FL algorithms can then be characterized
via the contraction properties of the underlying operator [25].
A main tool for the design the FL algorithms are variants of gradient
descent (GD). These gradient-based methods are based on approximating a
differentiable function f (x) locally by a linear function given by the gradient
∇f (x). We therefore expect some familiarity with multivariable calculus [5].

1.4 Related Courses

In what follows we briefly explain how this course CS-E4740 relates to selected
courses at Aalto University.

• CS-EJ3211 - Machine Learning with Python. Teaches the ap-


plication of basic ML methods using the Python package (library)
scikit-learn [26]. CS-E4740 couples a network of basic ML methods
using regularization techniques to obtain tailored (personalized) ML

4
models for local datasets. This coupling is required to adaptive pool
local datasets obtain a sufficiently large training set for the personalized
ML model.

• CS-E4510 - Distributed Algorithms. Teaches basic mathematical


tools for the study and design of distributed algorithms that are im-
plemented via distributed systems (computers) [27]. FL is enabled by
distributed algorithms to train ML models from decentralized data (see
Lecture 5).

• CS-C3240 - Machine Learning (spring 2022 edition). Teaches


basic theory of ML models and methods [4]. CS-E4740 combines the
components of basic ML methods, such as data representation and
models, with network models. In particular, instead of a single dataset
and a single model (such as a decision tree), we will study networks of
local datasets and local models.

• ABL-E2606 - Data Protection. This course discusses important legal


constraints (“laws”), including the European general data protection
regulation (GDPR), for the use of data and, in turn, for the design of
trustworthy FL methods.

• MS-C2105 - Introduction to Optimization. This course teaches ba-


sic optimisation theory and how to model applications as (linear, integer,
and non-linear) optimization problems. CS-E4740 uses optimization
theory and methods to formulate FL problems (see Lecture 3) and
design FL methods (see Lecture 5).

5
• ELEC-E5424 - Convex Optimization. This course teaches advanced
optimisation theory for the important class of convex optimization
problems [28]. Convex optimization theory and methods can be used
for the study and design of FL algorithms.

1.5 Main Goal of the Course

The overarching goal of the course is to demonstrate how to apply concepts


from graph theory and mathematical optimization to analyze and design FL
algorithms. Students will learn to formulate a given FL application as an
optimization problem over an undirected empirical graph G = (V, E) whose
nodes i ∈ V represent individual local datasets. We refer to this graph as the
empirical graph of a collection of local datasets (see Lecture 3).
This course uses only undirected empirical graphs with a finite number n
of nodes, which we identify with the first n positive integers:

V := {1, . . . , n}.

An edge {i, i′ } ∈ E in the empirical graph G connects two different local


datasets if they have similar statistical properties. We quantify the amount
of similarity by the positive edge weight Ai,i′ > 0.
We can formalize a FL application as an optimization problem associated
with an empirical graph,

(i) (i′ )
X X
Li w(i) + λ Ai,i′ d(w ,w ) . (1.1)

min
w(i)
i∈V {i,i′ }∈E

We refer to this problem as GTV minimization (GTVMin) and devote much


of the course to the discussion of its computational and statistical properties.

6
The optimization variables w(i) in (1.1) are local model parameters at the
nodes i ∈ V of an empirical graph. The objective function in (1.1) consists
of two components: The first component is a sum over all nodes of the loss
values Li w(i) incurred by local model parameters at each node i. The


second component is the sum of local model parameters variations across the
edges {i, i′ } of the empirical graph.

1.6 Outline of the Course

Our course is roughly divided into three parts:

• Part I: ML Refresher. Lecture 2 introduces data, models and loss


functions as three main components of ML. This lecture also explains
how these components are combined within empirical risk minimization
(ERM). We also discuss how regularization of ERM can be achieved
via manipulating its three main components. We then explain when
and how to solve regularized ERM via simple GD methods in Lecture
4. Overall, this part serves two main purposes: (i) to briefly recap basic
concepts of ML in a simple centralized setting and (ii) highlight ML
techniques (such as regularization) that are particularly relevant for the
design and analysis of FL methods.

• Part II: FL Theory and Methods. Lecture 3 introduces the empirical


graph as our main mathematical structure for representing collections of
local datasets and corresponding tailored models. The undirected and
weighted edges of the empirical graph represent statistical similarities
between local datasets. Lecture 3 also formulates FL as an instance of

7
regularized empirical risk minimization (RERM) which we refer to as
GTVMin. GTVMin uses the variation of personalized model parameters
across edges in the empirical graph as regularizer. We will see that
GTVMin couples the training of tailored (or “personalized”) ML models
such that well-connected nodes (clusters) in the empirical graph will
obtain similar trained models. Lecture 4 discusses variations of gradient
descent as our main algorithmic toolbox for solving GTVMin. Lecture
5 shows how FL algorithms can be obtained in a principled fashion
by applying optimization methods, such as gradient-based methods,
to GTVMin. We will obtain FL algorithms that can be implemented
as iterative message passing methods for the distributed training of
tailored (“personalized”) models. Lecture 6 derives some main flavours
of FL as special cases of GTVMin. The usefulness of GTVMin crucially
depends on the choice for the weighted edges in the empirical graph.
Lecture 7 discusses graph learning methods that determine a useful
empirical graph via different notions of statistical similarity between
local datasets.

• Part III: Trustworthy AI. Lecture 8 enumerates seven key require-


ments for trustworthy artificial intelligence (AI) that have been put
forward by the European Union. These key requirements include the
protection of privacy as well as robustness against (intentional) pertur-
bations of data or computation. We then discuss how FL algorithms
can ensure privacy protection in Lecture 9. Lecture 10 discusses how
to evaluate and ensure robustness of FL methods against intentional
perturbations (poisoning) of local dataset.

8
1.7 Assignments

The course will consist of assignments, each covering the topics of a correspond-
ing lecture. Each assignment requires you to implement the concepts discussed
in the corresponding lecture using Python. After solving the assignment, you
can answer MyCourses quizzes.

1.8 Student Project

You can extend the basic variant (which is worth 5 credits) to 10 credits by
completing a student project and peer review. This project requires you to
formulate an application of your choice as a FL problem using the concepts
from this course. You then have to solve this FL problem using the FL
algorithms taught in this course. The main deliverable will be a project
report which must follow the structure indicated in the template. You will
then peer-review the reports of your fellow students by answering a detailed
questionnaire.

1.9 Schedule

The course lectures are held on Mo. and Wed. at 16.15, during 28-Feb-2024
until 30-Apr-3024. You can find the detailed schedule and lecture halls
following this link. As the course can be completed fully remote, we will
record each lecture and add the recording to the YouTube playlist here in a
timely fashion.
After each lecture, we will release the corresponding assignment at this
site. You will have then at least one week to work on the assignment before

9
we open the corresponding quiz on the MyCourses page of the course (click
me).

1.10 Ground Rules

Note that as a student following this course, you must act according to the
Code of Conduct of Aalto University. In particular, the main ground rules
for this course are:

• BE HONEST. This course includes many tasks that require indepen-


dent work, including the coding assignments, the working on student
projects and the peer review of student projects. You must not use
other’s work inappropriately. For example, it is not allowed to copy
other’s solutions to coding assignments. We will randomly choose stu-
dents who have to explain their solutions (and corresponding answers
to quiz questions).

• BE RESPECTFUL. My personal wish is that this course provides a


safe space for an enjoyable learning experience. Any form of disrespectful
behaviour, including any course-related communication platforms, will
be sanctioned rigorously (including reporting to university authorities).

10
2 Lecture - “ML Basics”
This lecture covers basic ML techniques that are instrumental for FL. This
lecture is signifcantly more extensive content-wise compared to the following
lectures. However, it should be relatively easy to follow as it mainly refreshes
pre-requisite knowledge.

2.1 Learning Goals

After this lecture, you should

• be familiar with the concept of data points (their features and labels),
model and loss function,

• be familiar with ERM as a design principle for ML systems,

• know why and how validation is performed,

• know to regularize ERM via modifying data, model and loss.

2.2 Three Components and a Design Principle

Machine Learning (ML) revolves around learning a hypothesis map h out of


a hypothesis space H that allows to accurately predict the label of a data
point solely from its features. One of the most crucial steps in applying ML
methods to a given application domain is the definition or choice of what
precisely a data point is. Coming up with a good choice or definition of data
points is not trivial as it influences the overall performance of a ML method
in many different ways.

1
During this course we will focus mainly on one specific choice for the data
points. In particular, we will consider data points that represent the daily
weather condition around a weather station of the Finnish Meteorological
Institute (FMI). We denote a specific data point by z. It is characterized by
the following features:

• name of the FMI weather station, e.g., “TurkuRajakari”

• latitude lat and longitude lon of the weather station, e.g., lat := 60.37788,
lon := 22.0964,

• date of the day in format DDMMYYYY, e.g., 01022022

• minimum daytime temperature.

It is convenient to stack the features into a feature vector x. The label y ∈ R


of such a data point is the maximum daytime temperature.
We predict the label by the function value hypothesis h(x). The prediction
will typically be not perfect, i.e., h(x) ̸= y. We measure the prediction error by
2
a loss function such as the squared error loss L (z, h) := y − h(x) . It seems
natural to choose (or learn) a hypothesis that incurs minimum average loss (or
empirical risk) on a given set of data points D := x(1) , y (1) , . . . , xm , y (m) .
  

This is known as ERM,


m
X 2
ĥ ∈ argmin(1/m) y (r) − h x(r) (2.1)
h∈H r=1

As our notation indicates (using the symbol “∈” instead of “:=”), there
might be several different solutions to the optimization problem (2.1). Unless
specified otherwise, ĥ can be used to denote any hypothesis in H that has
minimum average loss over D.

2
Many machine learning (ML) methods employ a paramterized model H
where each hypothesis h ∈ H is defined by a parameter vector w ∈ Rd . A
prominent instance of such a parameterized model is the linear model [4, Sec.
3.1],
H(d) := h(x) := wT x. (2.2)

Linear regression, for example, determines the parameters of a linear


model by minimizing the average squared error loss. For linear regression,
ERM becomes an optimization over the parameter space Rd ,
m
X 2
w
b (LR)
∈ argmin (1/m) y (r) − wT x(r) . (2.3)
w∈Rd r=1
| {z }
:=f (w)

Note that (2.3) amounts to finding the minimum of a smooth and convex
function
 
T T T T
f (w) = (1/m) w X Xw − 2y Xw + y y (2.4)
T
with the feature matrix X := x(1) , . . . , x(m) (2.5)
T
and the label vector y := y (1) , . . . , y (m) of the training set D.
(2.6)

Inserting (2.4) into (2.3) allows to formulate linear regression as

b (LR) ∈ argmin wT Qw + wT q
w (2.7)
w∈Rd

with Q := (1/m)XT X, q := −(2/m)XT y.

To train a ML model H means to solve ERM (2.1) (or (2.3) for linear
regression); the dataset D is therefore referred to as a training set. The trained

3
wT Qw + wT q

b (LR)
w

Figure 1: ERM (2.1) for linear regression minimizes a convex quadratic


function wT Qw + wT q.

model results in the learnt hypothesis ĥ. We obtain practical ML methods by


applying optimiziation algorithms to solve (2.1). Two key questions arise:

• computational aspects How much computation is required to solve


(2.1) ?

• statistical aspects How useful is the solution ĥ to (2.1) in practice,


i.e., how accurate is the prediction ĥ(x) for the label y of an arbitrary
data point with features x?

2.3 Computational Aspects of ERM

ML methods use optimization algorithms to solve (2.1) in order to learn


a hypothesis ĥ. Within this course, we use optimization algorithms that
are iterative methods: Starting from an initial choice h(0) , they construct a
sequence
h(0) , h(1) , h(2) , . . . ,

4
which are hopefully increasingly accurate approximations to a solution ĥ of
(2.1). The computational complexity of such a ML method can be measured
by the number of iterations required to guarantee some prescribed level of
approximation.
For a parameterized model and a smooth loss function, we can solve (2.3)
by gradient-based methods: Starting from an initial parameters w(0) , we
iterate the gradient step:

w(k) := w(k−1) − α∇f w(k−1)




m
X T
= w(k−1) + (2α/m) x(r) y (r) − w(k−1) x(r) . (2.8)

r=1

How much computation do we need for one iteration of (2.8)? How many
iterations do we need ? We will try to answer the latter question in Lecture
4. The first question can be answered more easily for typical computational
infracstructure (e.g., “Python running on a commercial Laptop”). Indeed, a
naive evaluation of (2.8) requires around m arithmetic operations (addition,
multiplication).
It is instructive to consider the special case of a linear model which does
not use any feature, i.e., h(x) = w. For this extreme case, the ERM (2.3) has
a simple closed-form solution:
m
X
w
b = (1/m) x(r) . (2.9)
r=1

Thus, for this special case of the linear model, solving (2.9) amounts to
summing m numbers x(1) , . . . , x(m) . It seems reasonable to assume that the
amount of computation required to compute (2.9) is proportional to m.

5
2.4 Statistical Aspects of ERM

We have formulated the training of a linear model on a given training set as


ERM (2.3). But how useful is its solution w
b for predicting the labels of data
points outside the training set? Consider applying the learnt hypothesis h(w)
b

to an arbitrary data point with label y and features x that is not contained
in the training set. What can we say about the resulting prediction error
y − h(w)
b
(x) in general? In other words, how well does h(w)
b
generalize beyond
the training set.
Maybe the most widely used approach to study generalization of ML
methods is via a probabilistic perspective. Here, we interpret each data point
as a realization of an i.i.d. RV with probability distribution p(x, y). Under
this i.i.d. assumption, we can evaluate the overall performance of a hypothesis
h ∈ H via the expected loss (or risk)

E{L ((x, y), h)}. (2.10)

One example for a probability distribution p(x, y) is obtained via relating


the label y with the features x of a data point as

y = wT x + ε with x ∼ N (0, I), ε ∼ N (0, σ 2 ). (2.11)

A simple calculation reveals the expected squared error loss of a given linear
hypothesis h(x) = xT w
b as

b 2 + σ2.
E{(y − h(x))2 } = ∥w − w∥ (2.12)

The component σ 2 can be interpreted as intrinsic noise level of the label y.


We cannot hope to find a hypothesis with expected loss smaller than this level.

6
The first component of the RHS in (2.12) is the estimation error ∥w − w∥
b 2
of a ML method that reads in the training set and delivers an estimate w
b
(e.g., via (2.3)) for the parameters of a linear hypothesis.
We next study the estimation error w − w
b incurred by the specific estimate
w b (LR) (2.7) delivered by linear regression methods. To this end, we first
b =w
use the probabilistic model (2.11) to decompose the label vector y in (2.6) as
T
y = Xw + n , with n := ε(1) , . . . , ε(m) . (2.13)

Inserting (2.13) into (2.7) yields

b (LR) ∈ argmin wT Qw + wT q′ + wT e
w (2.14)
w∈Rd

with Q := (1/m)XT X, q′ := −(2/m)XT Xw, and e := −(2/m)XT n (2.15)

We illustrate the objective function of (2.14) in Figure 2. This function can be


interpreted as a perturbation of the convex quadratic function wT Qw + wT q′
which is minimized at w = w. In general, the minimizer w
b (LR) delivered by
linear regression is different from w due the perturbation term wT e in (2.14).
Let us assume in what follows that the matrix Q = (1/m)XT X is invert-
ible.1 It is then not too difficult to verify the following upper bound

b (LR) − w (2.16)

w 2
≤ ∥e∥2 /λmin Q .

Here, λmin Q denotes the smallest eigenvalue of the matrix Q = (1/m)XT X ∈




Rd×d . Note that the matrix Q is psd and therefore its eigenvalues are all
real-valued and non-negative [24]. Moreover, since we assume Q is invertible,
they are strictly positive and, in turn, λmin Q > 0.


1
Can you think of sufficient conditions on the feature matrix of the training set that
ensure Q = (1/m)XT X is invertible?

7
T T ′
wT e
w Qw + w (q + e)

wT Qw + wT q′

b (LR)
w w

Figure 2: The estimation error of linear regression is determined by the


effect a linear perturbation term wT e on the minimizer of a convex quadratic
function.

2.5 Validation and Diagnosis of ML

The above analysis of the generalization error started from postulating a


probabilistic model for the generation of data points. However, this proba-
bilistic model might be wrong and the bound (2.16) does not apply. Thus,
we might want to use a more data-driven approach for assesing the usefulness
of a trained model.
Validation methods try to find out if a learnt hypothesis ĥ, does well
outside the training set. In its most basic form, model validation amounts to
computing the average loss of a learnt hypothesis ĥ on some data points that
have not been included in the training set. We refer to these data points as
the validation set.
The most basic workflow of ML model training and validation can be
summarized as follows:

8
1. gather a dataset and choose a model H

2. split dataset into a training set D(train) and a validation set D(val)

3. learn a hypothesis via solving ERM


X
h ∈ argmin
b L ((x, y), h) (2.17)
h∈H
(x,y)∈D(train)

4. compute resulting training error


X  
(train)
Et := (1/|D |) L (x, y), h
b
(x,y)∈D(train)

5. compute validation error


X  
Ev := (1/|D(val) |) L (x, y), b
h
(x,y)∈D(val)

We can diagnose a ERM based ML method by comparing its training


error with its validation error. This diagnosis is further enabled if we know a
baseline E (ref) . One important source for a baseline E (ref) are probabilistic
models for the data points (see Section 2.4).
Given a probabilistic model p(x, y), we can compute the minimum achiev-
able risk (2.10). Indeed, the minimum achievable risk is precisely the expected
loss of the Bayes estimator b
h(x) of the label y, given the features x of a
data point. The Bayes estimator b
h(x) is fully determined by the probability
distribution p(x, y) [29, Chapter 4].
A further potential source for a baseline E (ref) is an existing, but for
some reason unsuitable, ML method. This existing ML method might be
computationally too expensive to be used for the ML application at end.
However, we might still use its statistical properties as a benchmark.

9
We can also use the performance of human experts as a baseline. If we
want to develop a ML method that detects certain type of skin cancers from
images of the skin, a benchmark might be the current classification accuracy
achieved by experienced dermatologists [30].
We can diagnose a ML method by comparing the training error Et with
the validation error Ev and (if available) the benchmark E (ref) .

• Et ≈ Ev ≈ E (ref) : The training error is on the same level as the


validation error error and the baseline. There is not much to improve
here since the validation error is already close to the baseline. Moreover,
the training error is not much smaller than the validation error which
indicates that there is no overfitting.

• Ev ≫ Et : The validation error is significantly larger than the training


error. This is an indicator for overfitting which can be addressed either
by reducing the effective dimension of the hypothesis space or by increas-
ing the size of the training set. We can reduce the effective dimension
of the hypothesis space by using fewer features, a smaller maximum
depth of decision trees or fewer layers in an ANN. An alternative to
this discrete model selection, we can also reduce the effective dimension
of a hypothesis space via regularization techniques.

• Et ≈ Ev ≫ E (ref) : The training error is on the same level as the


validation error and both are significantly larger than the baseline.
Since the training error is not much smaller than the validation error,
the learnt hypothesis seems to not overfit the training set. However, the
training error achieved by the learnt hypothesis is significantly larger

10
than the baseline. There can be several reasons for this to happen.
First, it might be that the hypothesis space is too small, i.e., it does
not include a hypothesis that provides a good approximation for the
relation between features and label of a data point. One remedy to
this situation is to use a larger hypothesis space, e.g., by including
more features in a linear model, using higher polynomial degrees in
polynomial regression, using deeper decision trees or ANNs (deep ANN
(deep net)s). Second, besides the model being too small, another reason
for a large training error could be that the optimization algorithm used
to solve ERM (2.17) is not working properly (see Lecture 4).

• Et ≫ Ev : The training error is significantly larger than the validation


error. The idea of ERM (2.17) is to approximate the risk (2.10) of a
hypothesis by its average loss on a training set D = {(x(r) , y (r) )}m
r=1 .

The mathematical underpinning for this approximation is the law of


large numbers which characterizes the average of (realizations of) i.i.d.
RVs. The accuracy of this approximation depends on the validity of two
conditions: First, the data points used for computing the average loss
“should behave” like realizations of i.i.d. RVs with a common probability
distribution. Second, the number of data points used for computing the
average loss must be sufficiently large.

Whenever the data points behave different than the the realizations
of i.i.d. RVs or if the size of the training set or validation set is too
small, the interpretation (and comparison) of the training error and
the validation error of a learnt hypothesis becomes more difficult. As
an extreme case, the validation set might consist of data points for

11
which every hypothesis incurs small average loss. Here, we might try
to increase the size of the validation set by collecting more labeled
data points or by using data augmentation (see Section 2.6). If the
size of training set and validation set are large but we still obtain
Et ≫ Ev , one should verify if data points in these sets conform to the
i.i.d. assumption. There are principled statistical test for the validity of
the i.i.d. assumption for a given dataset (see [31] and references therein).

2.6 Regularization

Consider a ERM-based ML method using a hypothesis space H and dataset


D (we assume all data points are used for training). A key parameter for such
a ML methos is the ratio deff (H) /|D| between the model size deff (H) and
the number |D| of data points. The tendency of the ML method to overfit
increases with the ratio deff (H) /|D|.
Regularization techniques reduce the ratio deff (H) /|D| via three (essen-
tially equivalent) approaches:

• collect more data points, possibly via data augmentation (see Fig. 3),

• add penalty term λR h to average loss in ERM (2.1) (see Fig. 3),


• shrink the hypothesis space, e.g., by adding constraints on the model


parameters such as ∥w∥2 ≤ 10.

[4, Ch. 7] discusses the equivalence between these three perspectives on


regularization in somewhat more detail.
One important example for regularization via adding a penalty term to
the average loss is ridge regression. In particuar, ridge regression uses the

12
regularizer R h := ∥w∥22 for a linear hypothesis h(x) := wT x. Thus, ridge


regression learns the weights of a linear hypothesis via solving


 m 
T (r) 2
X
(ridge) (r) 2
(2.18)

w
b ∈ argmin (1/m) y −w x + λ ∥w∥2 .
w∈Rd r=1

The objective function in (2.18) is also obtained if we replace each data point
(x, y) ∈ D by a sufficient large number of i.i.d. realizations of

(x + n, y) with n ∼ N (0, λI). (2.19)

Thus, ridge regression (2.18) is equivalent to linear regression applied to


an augmented variant D′ of D. The augmentation D′ is obtained by replacing
each data point (x, y) ∈ D with a sufficiently large number of noisy copies.
Each copy is obtained by adding a i.i.d. realization n of a zero-mean Gaussian
noise with covariance matrix λI to the features x (see (2.19)). The label of
each copy is equal to y, i.e., the label is not perturbed.

label y
h(x)
original training set D
augmented

λ
1 Pm 
x(r) , y (r) , h +λR h
 
m r=1 L
feature x

Figure 3: Equivalence between data augmentation and loss penalization.

To study the computational aspects of ridge regression, let us rewrite

13
(2.18) as

b (ridge) ∈ argmin wT Qw + wT q
w
w∈Rd

with Q := (1/m)XT X + λI, q := (−2/m)XT y. (2.20)

Thus, like linear regression (2.7), also ridge regression minimizes a convex
quadratic function. A main difference between linear regression (2.7) and
ridge regression (for λ > 0) is that the matrix Q in (2.20) is guaranteed to
be invertible for any training set D. In contrast, the matrix Q in (2.7) for
linear regression might be singular for some training sets.2

2.7 Assignment

Source Code. Assignment_MLBasics.py


Data File. Assignment_MLBasicsData.csv
Description. The coding assignment revolves around weather data collected
by the FMI and stored in a csv file. This file contains temperature measure-
ments at different locations in Finland. Each temperature measurement is a
T
data point, characterized by d = 7 features x = x1 , . . . , x7 and a label y
which is the temperature measurement itself. The features are (normalized)
values of latitude, longitude of the FMI station where the measurement has
been taken as well as the year, month, day, hour, mimute at which this
measurements has been taken. Your tasks are

1. generate numpy arrays, whose r-th row holds the features x(r) and label
y (r) , respectively, of the r-th data point in the csv file.
2
Consider the extreme case where all features of each data point in the training set D
are zero.

14
2. Split the dataset into a training set and validation set. The size of the
training set should be 100.

3. Train a linear model, using the LinearRegression class of the scikit-learn


package, on the training set and determine the resulting training error
and validation error

4. Augment the original features by their polynomial combinations using


the PolynomialFeatures class.

5. Train and validate a linear model for different choices for the maximal
polynomial degree used in the previous feature augmentation step.

6. Using a fixed value for the polynomial degree for the feature augmen-
tation step, train and validate a linear model using ridge regression
(2.18) via the Ridge class. For each choice of λ in (2.18), determine the
resulting training error and validation error.

15
3 Lecture - “FL Design Principle”
Lecture 2 reviewed ML methods that use numeric arrays to store data and
model parameters. We have also discussed ERM as a design principle or prac-
tical ML systems. This lecture will extend these concepts to FL applications.
Section 3.2 introduces empirical graphs to store collections of local datasets
and corresponding parameters of local models. Section 3.3 presents our main
design principle for FL systems. This principle uses the variation of local
model parameters across the edges of an empirical graph for the coupling (or
regularization) of the individual local models.

3.1 Learning Goals

After this lecture, you should

• be familiar with the concept of an empirical graph,

• know how connectivity is related to spectrum of Laplacian matrix,

• know some measures for the variation of local models,

• be familiar with the concept of GTVMin.

3.2 Empirical Graphs and Their Laplacian

Consider a FL application that invovles a collection of local datasets D(1) , . . . , D(2) .


Our goal is to train a personalized model H(i) for each local dataset D(i) , with
i = 1, . . . , n. We represent such a collection of local datasets and (personal)
local models, along with their relations, by an empirical graph. Figure 4
depicts an example for an empirical graph.

1
′ ′
D(i ) , w(i )

Ai,i′

D(i) , w(i)

Figure 4: Example of an empirical graph whose nodes i ∈ V carry local


datasets D(i) and local models that are parametrized by local model parame-
ters w(i) .

An empirical graph is an undirected weighted graph G = (V, E) whose


nodes V := {1, . . . , n} represent local datasets D(i) , for i ∈ V. Each node
i ∈ V of the empirical graph G carries a separate local dataset D(i) .
To build intuition, think of a local dataset D(i) as a labelled dataset

D(i) := x(i,1) , y (i,1) , . . . , x(i,mi ) , y (i,mi ) . (3.1)


  

Here, x(i,r) and y (i,r) denote, respectively, the features and the label of the
rth data point in the local dataset D(i) . Note that the size mi of the local
dataset might vary between different nodes i ∈ V.
It is convenient to collect the feature vectors x(i,r) and labels y (i,r) into a
feature matrix X(i) and label vector y(i) , respectively,
T T
X(i) := x(i,1) , . . . , x(i,mi ) , and y := y (1) , . . . , y (mi ) . (3.2)

The local dataset D(i) can then be represented compactly by the matrix
X(i) ∈ Rmi ×d and the vector y(i) ∈ Rmi .
Besides its local dataset D(i) , each node i ∈ G also carries a local model
H(i) . Whitin this course, we focus on local models that are parametrized
by local model parameters w(i) ∈ Rd , for i = 1, . . . , n. The usefulness of a

2
specific choice for the local model parameter w(i) is measured by a local loss
function Li w(i) , for i = 1, . . . , n.


An undirected edge {i, i′ } ∈ E between two different nodes i, i′ ∈ V couples



the training of the corresponding local models H(i) , H(i ) . We quantify the
strength of this coupling by a positive edge weight Ai,i′ > 0. The coupling will
be implemented by penalizing the discrepancy between local model parameters

w(i) and w(i ) (see Section 3.3). Unless noted otherwise, we measure this
′ 2
discrepancy by the squared Euclidean distance w(i) − w(i ) 2 .
We can characterize the connectivity of an empirical graph via the eigen-
values and eigenvectors of its Laplacian matrix L ∈ Rn×n . The Laplacian
matrix is defined element-wise as

−Ai,i′ for i ̸= i′ , {i, i′ } ∈ E






Li,i′ := for i = i′ (3.3)
P
 i′′ ̸=i Ai,i′′


else.

0

The Laplacian matrix is psd which follows from the identity


X ′ 2
wT Lw = Ai,i′ w(i) − w(i )
2
{i,i′ }∈E
 T
(1) T (n) T
for any w := (3.4)
 
w ,..., w .
| {z }
n
:=stack w(i)
i=1

Since the matrix L is psd, all its eigenvalues are real-valued and non-negative.
We denote its increasingly ordered eigenvalues by

0 ≤ λ1 ≤ λ2 . . . ≤ λn . (3.5)

3
Acccording to (3.4), we can measure the total variation of local model
parameters by stacking them into a single vector w ∈ Rnd and computing the
quadratic form wT Lw.
One immediate consequence of (3.4) is that any collection of identical

local model parameters, w(i) = w(i ) results in an eigenvector
 T
 
(1) T (n) T
(3.6)

c= w ,..., w .

with eigenvalue λ = 0. Thus, the eigenvalue λ = 0 conincides with the


smallest eigenvalue λ1 (see (3.5)).
The second eigenvalue λ2 of the Laplacian matrix provides a great deal of
information about the connectivity structure of G. Consider the case λ2 = 0,
i.e., besides c there is another eigenvector with zero eigenvalue. Then, the
graph G contains two subsets (components) of nodes that do not have any
edge between them.
On the other hand, if λ2 > 0 then G is connected. Moreover, the larger
the value of λ2 , the stronger the connectivity between the nodes in G. We
next show how to make this vague statement more precise via the identity
(3.4).
The total variation on the RHS (3.4) is a measure for the connectivity.
Indeed, if we assume that the local model parameters w(i) are different, then
adding an edge will increase the total variation. We can lower bound this
total variation as
n
2 2
(i′ )
X X
Ai,i′ w (i)
−w ≥ λ2 w(i) − m 2
. (3.7)
2
{i,i′ }∈E i=1

Here, m = (1/n) i=1 w(i) is the average of all local model parameters. The
Pn
2
quantity ni=1 w(i) − m 2 has a geometric interpretation: It is the squared
P

4
1 2 1 2

3 4 3 4

Figure 5: Left: Some empirical graph G consisting of n = 4 nodes. Right:


Equivalent fully connected empirical graph G ′ with the same nodes and
non-zero edge weights A′i,i′ = Ai,i′ and A′i,i′ = 0 for {i, i′ } ∈
/ E.

Euclidean norm of the projection of the stacked local model parameters


 T
 
(1) T (n) T
on the subspace

w := w ,..., w

 
T T
T
, for some a ∈ R d
⊆ Rdn . (3.8)

a ,...,a

It might be convenient to replace a given empirical graph G with an


equivalent fully connected empirical graph G ′ (see Figure 5). The graph G ′
has an edge between each pair of different nodes i, i′ ,

E ′ = {i, i′ } , with some i, i′ ∈ V, i ̸= i′ .




The edge weights are chosen A′i,i′ = Ai,i′ for any edge {i, i′ } ∈ E and A′i,i′ = 0
otherwise.
Note that the undirected edges E of an empirical graph encode a symmetric
notion of similarity between local datasets: If the local dataset D(i) at node i

is similar to the local dataset D(i ) at node i′ , i.e., {i, i′ } ∈ E, then also the

local dataset D(i ) is similar to the local dataset D(i) .

5
3.3 Generalized Total Variation Minimization

Consider data with empirical graph G whose nodes i ∈ V carry local datasets
D(i) and local model parametrized by the vector w(i) . To learn these parameter
vectors, we try to minimize their local loss and at the same time enforce
a small total variation. The optimal balance is obtained is via solving the
following opimization problem, which we refer to as generalized total variation
(GTV) minimization,

n
X X ′ 2
b (i) Li w(i) + λ Ai,i′ w(i) − w(i ) (GTVMin).
 
w i=1
∈ argmin
{w(i) } 2
i∈V i,i′ ∈V
(3.9)
Note that GTVMin is an instance of RERM: The regularizer is the total
variation of local model parameters over weighted edges Ai,i′ of the empirical
graph. Clearly, the empirical graph is an important design choice for GTVMin-
based methods. This choice can be guided by computational aspects and
statistical aspects of GTVMin-based FL systems.
Some application domains allow to leverage domain expertise to guess
a useful choice for the empirical graph. If local datasets are generated at
different geographic locations, we might use nearest neighbor graphs based
on geodesic distances between data generators (e.g., FMI weather stations).
Lecture 7 will also discuss graph learning methods that determine edge weights
Ai,i′ in a fully data-driven fashion.
Let us now consider the special case of GTVMin with local models being
a linear model. For each node i ∈ V of the empirical graph, we want to learn
T
the parameters w(i) of a linear hypothesis h(i) (x) := w(i) x. We measure

6
the quality of the weigths via the average squared error loss
mi  2
(i) T (i,r)
X
(i) (i,r)
 
Li w := (1/mi ) y − w x
r=1
(3.2) 2
= (1/mi ) y(i) − X(i) w(i) 2
. (3.10)

Inserting (3.10) into (3.9), yields the following instance of GTVMin to


train local linear models,

n
X 2 X ′ 2
b (i) y(i) −X(i) w(i) Ai,i′ w(i) −w(i )

w i=1
∈ argmin (1/mi ) 2
+λ .
{w(i) } i∈V 2
i,i′ ∈V

(3.11)

The identity (3.4) allows to rewrite the GTVMin instance (3.11) as

n
X 2
b (i) (1/mi ) y(i) −X(i) w(i) +λwT Lw. (3.12)

w i=1
∈ argmin 2
 n
b (i)
w=stack w i∈V
i=1

Let us rewrite the objective function in (3.12) as


  
(1)
Q 0 ··· 0
  
(2)
···
  
 0 Q 0 
 +λL ⊗ I w+ q(1) T , . . . , q(n) T w
T 
   
w  . . .. .
 .. .. . ..   

  
(n)
0 0 ··· Q
(3.13)

T T
with Q(i) = (1/mi ) X(i) X(i) , and q(i) := (−2/mi ) X(i) y(i) .

Thus, like linear regression (2.7) and ridge regression (2.20), also GTVMin
(3.12) (for local linear models H(i) ) minimizes a convex quadratic function,

min wT Qw + qT w. (3.14)
w

7
Here, we used the psd matrix
 
(1)
Q 0 ··· 0
 
(2)
···
 
 0 Q 0 
 +λL⊗I with Q(i) := (1/mi ) X(i) T X(i) (3.15)

Q := 
 .. .. ... .
 . . ..  
 
(n)
0 0 ··· Q

and the vector

T T T T
q := q(1) , . . . , q(n) , with q(i) := (−2/mi ) X(i) y(i) . (3.16)

3.3.1 Computational Aspects of GTVMin

Lecture 5 will apply optimization methods to solve GTVMin, resulting in


practical FL algorithms. Different instances of GTVMin favor different classes
of optimization methods. For example, using a differentiable loss function
allows to apply gradient-based methods (see Lecture 4) to solve GTVMin.
Another important class of loss functions are those for which we can
efficiently compute the proximity operator

2
proxL,ρ (w) := argmin L(w′ ) + (ρ/2) ∥w − w′ ∥2 for some ρ > 0.
w′

Some authors refer to functions L for which proxL,ρ (w) can be computed
easily as simple or proximable [32]. GTVMin with proximable loss functions
can be solved quite efficiently via proximal algorithms [33].
Besides influencing the choice of optimization method, the design choices
underlying GTVMin also determine the amount of computation needed by a
given optimization method. For example, using an empirical graph with rela-
tively few edges (“sparse graphs”) typically results in a smaller computational

8
complexity. Indeed, Lecture 5 discusses GTVMin-based algorithms requiring
an amount of computation that is proportional to the number of edges in the
empirical graph.
Let us now consider the computational aspects of GTVMin (3.11) to
train local linear models. As discussed above, this instance is equivalent to
solving (3.14). Any solution w
b of (3.14) is characterized by the zero-gradient
condition
b = −(1/2)q,
Qw (3.17)

with Q, q as defined in (3.15) and (3.16).

3.3.2 Statistical Aspects of GTVMin

The empirical graph should contain sufficient number of edges between nodes
that carry statistically similar local datasets. This allows regularization
techniques to adaptively pool local datasets into clusters of (approximately)
homogeneous data (see Section 6.3).
Statistically, we want the loss function to favour local model parame-
ters that result in a robust and accurate trained local model, with model
parameters w
b (i) , for each node i ∈ V.

3.4 Assignment

Python source. Assignment_FLDesignPrinciple.py


This assignment revolves around a collection of temperature measurements
that we store in the empirical graph G (FMI) . Each node i ∈ V represents
a FMI weather station and stores a local dataset D(i) . The local dataset
contains mi temperature measurements y (i,1) , . . . , y (i,mi ) . The edges of G (FMI)

9
connect each FMI station i to its nearest neighbors i′ . All edges {i, i′ } ∈ E
have the same edge weight Ai,i′ = 1
For each station i ∈ V, you need to learn the single parameter w(i) ∈ R
of a hypothesis h(x) = w(i) that predicts the temperature. We measure
the quality of a hypothesis by the average squared error loss Li w(i) =

2
(1/mi ) m (i,r)
−w(i) . You should learn the parameters w(i) via balancing
P i
r=1 y

the local loss with the total variation of w(i) ,


n
′ 2
X X
(i)
Li w(i) + λ w(i) − w(i ) . (3.18)

w
b ∈ argmin
n
w(i) i=1 {i,i′ }∈E
i=1

Your tasks are

1. Reformulate (3.18) as (3.12) using a suitable choice for features x(i,r) .

2. Based on this reformulation, characterize solutions w


b(i) of (3.18) via
(3.17) (determine the matrix Q and vector q in terms of the temperature
measurements and Laplacian matrix L(FMI) of the empirical graph
G (FMI) ).

10
4 Lecture - “Gradient Methods”
Lecture 3 introduced GTVMin as a central design principle for FL methods.
Several important instances of GTVMin amount to minimizing a smooth
objective function over (a subset of) the parameter space Rd . This lecture
discusses Gradient-based methods which is a widely-used family of iterative
algorithms for minimizing a smooth function. These methods share a core idea:
approximate the objective function locally using its gradient at the current
choice for the model parameters. Lecture 5 discusses FL algorithms obtained
from direct application of gradient-based methods to solving GTVMin.

4.1 Learning Goals

After this lecture, you should

• understand the effect of a gradient step for a smooth and strongly


convex objective function

• understand the role of the step size or learning rate,

• know some stopping criterion,

• be able to analyze the effect of perturbations in the gradient step,

• know about projected GD to cope with constraints on model parameters.

4.2 The Basic Idea of the Gradient Step

gradient-based methods are interative algorithms for finding the minimimum


of a differentiable f (w). One example for such a function is objective function

1
(2.3) of linear regression. A gradient step updates a current choice for w(curr)
along the opposite direction of the gradient ∇f (w) at the current choice,

w(new) := w(curr) − α∇f w(curr) . (4.1)




The gradient step (4.1) involves the factor α which is referred to as step-size
or learning rate.
The usefulness of gradient-based methods depends crucially on the diffi-
culty of evaluating the gradient. Evaluating the gradient of a given function
has been made convenient by modern software libraries (such as PyTorch) that
provide quite efficient methods for computing the gradient (autograd/back-
prop/...). However, besides the actual compuation of the gradient, it might
be challegning to gather the required data points which define the objective
function (empirical risk).
Algorithm 1 summarizes the most basic instance of gradient-based methods.

Algorithm 1 A blueprint for gradient-based methods


Input: function f (w) ; learning rate α > 0; some stopping criterion;
Initialize: set w(0) := 0; set iteration counter r := 0
1: repeat
2: r := r + 1 (increase iteration counter)
w(r) := w(r−1) − α∇f w(r−1) (do a gradient step (4.1))

3:

4: until stopping criterion is met


b := w(r) (hopefully f w

Output: w b ≈ minw f (w))

2
4.3 Hyperparameter of gradient-based methods

Note that Algorithm 1, as most other gradient-based methods, involve at


least two hyper-parameters: (i) the learning rate α used for the gradient step
and (ii) a stopping criterion this is used to decide when to stop repeating the
gradient step.

4.4 Perturbed Gradient Step

4.5 Constraints

4.6 Assignment

3
5 Lecture - “FL Algorithms”
This lecture applies the gradient-based methods from Lecture 4 to solve
GTVMin from Lecture 3. The resulting FL algorithms can be implemented
by message passing over the edges of the empirical graph.

5.1 Learning Goals

After this lecture, you should

• be able to derive the gradient for GTVMin with local linear models.

• be able to implement the gradient step for GTVMin as message passing

5.2 Gradient Step for GTVMin

5.3 Message Passing Implementation

5.4 Assignment

1
6 Lecture - “FL Main Flavors”
Lecture 3 discussed GTVMin as a main design principle for FL algorithms
that have been obtained in Lecture 5 by applying some of the gradient-based
methods from Lecture 4. This lecture discusses some important special cases
of GTVMin that are obtained for specific choies for the underyling empirical
graph.

6.1 Learning Goals

After this lecture, you should know about the following main flavours of FL:

• centralized FL

• clustered FL

• horizontal FL

• vertical FL

6.2 Centralized FL

6.3 Clustered FL

Many applications generate local datasets which do not carry sufficient statis-
tical power to guide learning of model parameters w(i) (see Section ??). As
a case in point, consider a local dataset D(i) of the form (3.1), with feature
vectors x(r) ∈ Rd with mi ≪ d. We would like to learn the parameter vector
w(i) of a linear hypothesis h(x) = xT w(i) .

1
6.4 Horizontal FL

6.5 Vertical FL

6.6 Assignment

2
7 Lecture - “Graph Learning”
Lecture 3 discussed GTVMin as a main design principle for FL algorithms.
The computational and statistical properties of these algorithms crucially
depend on the choice for the empirical graph. In some applications, domain
expertise can guide the choice for the empirical graph. However, it might
be useful to learn the empirical graph in a data-driven fashion. This lecture
discusses some of these graph learning techniques.

7.1 Learning Goals

After this lecture, you should

• understand the role of empirical graphs as a crucial design choice for


GTVMin-based methods (computational aspects, statistical aspects)

• know some quantitative measures for the similarity between local


datasets

• be able to learn a graph form given pairwise similarities and structural


contraints (e.g., bounded node degree)

7.2 Measuring (Dis-)Similarity Between Datasets

The above informal notion of similarity between local datasets. can be


made precise via a probabilistic model. Here, we interpret the local dataset
D(i) as realizations of RVs with some parametrized probability distribution
p(i) D(i) ; w(i) .


1
The discrepancy (or lack of similarity) between local datasets D(i) and

D(i ) could then be defined via the Euclidean distance

′ ′
d(i,i ) := w(i) − w(i ) ,
2

between the parameters of the probability distributions.


If local datasets consist of a single numeric measurement y (i) , we can use
′ ′
the discrepancy measure d(i,i ) := y (i) − y (i ) [34].

7.3 Graph Learning Methods



Assume we have constructed a useful measure d(i,i ) ∈ R+ for the discrepancy

between local datasets D(i) , D(i ) . We can then formulate the problem of
learning the edge weights Ai,i′ ∈ R+ as the optimization problem


min Ai,i′ d(i,i ) . (7.1)
Ai,i′

Unfortunately, the formulation (7.1) is not useful as it can be solved by the


trivial choice Ai,i′ . We therefore need to enforce the presence of some edges
(with positive weight) by adding constraints to (7.1). For example, we might
require
X
Ai,i = 0 , Ai,i′ = dmax for all i ∈ V, Ai,i′ ∈ [0, 1] for all i, i′ ∈ V. (7.2)
i′ ̸=i

The constraints (7.2) require that each node i is connected with other nodes
using total edge weight i′ ̸=i Ai,i′ = dmax . We can intepret the parameter
P

dmax as an effective node degree.


We combine the constraints (7.2) with (7.1) to obtain the following graph

2
learning principle,

bi,i′ ∈ argmin Ai,i′ d(i,i′ )


A (7.3)
Ai,i′

Ai,i′ ∈ [0, 1] for all i, i′ ∈ V,

Ai,i = 0 for all i ∈ V,


X
Ai,i′ = dmax for all i ∈ V.
i′ ̸=i

7.4 Assignment

3
8 Lecture - “Trustworthy FL”
This lecture discusses some key requirements for trustworthy AI that have been
put forward by the European Union. We will also see how these requirements
might guide the design choices for GTVMin. Our focus will be on the four
design criteria: robustness, privacy protection and explainablity. This lecture
discusses the robustness and explainability of basic linear regression that we
encountered in Lecture 2. We will see that regularization techniques allow to
navigate robustness-explainability-accuracy trade-offs.

8.1 Learning Goals

After this lecture, you should

• know some key requirements for trustworthy AI

• be familiar with quantiative measures of robustness and explainability

• have some intuition about how robustness, privacy and transparancy


guides design choices for local models, loss functions and empirical
graph in GTVMin

We can use the upper bound (2.16) to study the effect of perturbing
features and labels of data points.

1
9 Lecture - “Privacy-Protection in FL”
FL is inherently based on sharing information. Without any information
sharing between the owners (or generators) of local datasets, FL is not
possible.

9.1 Learning Goals

Afer this lecture, you should

• be aware of threats to privacy and the need to protect it,

• know some quantitative measures for privacy leakage,

• understand the effect of GTVMin design choices on user privacy,

• be able to implement FL algorithms with privacy guarantees.

9.2 Assignment

1
10 Lecture - “Data and Model Poisoning in FL”
This lecture discusses the robustness of FL systems against data poisoning
which are a specific type of cyber attackes.

10.1 Learning Goals

After this lecture, you should

1. be aware of threats posed by data and model poisoning

2. understand how GTVMin design choices result in more or less safe FL


algorithms

10.2 Data Poisoning

10.3 Model Poisoning

There is a trade-off between privacy protection and robustness against model


poisoning attacks. From [35]: Without secure aggregation, privacy is lost, but
the aggregator may attempt to filter out “anomalous” contributions. Since the
weights of a model created using

10.4 Assignment

1
Glossary
activation function Each artificial neuron within an ANN consists of an
activation function that maps the inputs of the neuron to a single output
value. In general, an activation function is a non-linear map of the
weighted sum of neuron inputs (this weighted sum is the activation of
the neuron). 13

artificial intelligence Artificial intelligence aims to develop systems that


behave rational in the sense of maximizing a long-term reward. 8

artificial neural network An artificial neural network is a graphical (signal-


flow) representation of a map from features of a data point at its input
to a predicted label at its output. 1, 4, 5, 10, 11, 13, 16, 22

baseline A reference value or benchmark for the average loss incurred by


a hypothesis when applied to the data points generated in a specific
ML application. Such a reference value might be obtained from human
performance (e.g., error rate of dermatologists diagnosing cancer from
visual inspection of skin areas) or other ML methods (“competitors”)
9–11

Bayes estimator A hypothesis h whose Bayes risk is minimal [29]. 1, 9

Bayes risk We use the term Bayes risk as a synonym for the risk or expected
loss of a hypothesis. Some authors reserve the term Bayes risk for the
risk of a hypothesis that achieves minimum risk, such a hypothesis being
referred to as a Bayes estimator [29]. 1

1
bias Consider some unknown quantity w̄, e.g., the true weight in a linear
model y = w̄x + e relating feature and label of a data point. We might
use an ML method (e.g., based on ERM) to compute an estimate ŵ
for the w̄ based on a set of data points that are realizations of RVs.
The (squared) bias incurred by the estimate ŵ is typically defined
2
as B 2 := E{ŵ} − w̄ . We extend this definition to vector-valued
2
b − w 2 . 12
quantities using the squared Euclidean norm B 2 := E{w}

classification Classification is the task of determining a discrete-valued label


y of a data point based solely on its features x. The label y belongs
to a finite set, such as y ∈ {−1, 1}, or y ∈ {1, . . . , 19} and represents a
category to which the corresponding data point belongs to. 11

computational aspects By computational aspects of a ML method, we


mainly refer to the computational resources required for its imple-
mentation. For example, if a ML method uses iterative optimization
technniques to solve ERM, then its computational aspects include (i)
how many arithmetic operations are needed to implement a single
iteration (gradient step) and (ii) how many iterations are needed to
obtain useful model parameters. One important example for an iterative
optimization technique is GD. 1, 4, 6, 12, 13

convex A set C ⊆ Rd is convex if it contains the line segment between any


two points of that set. We define a function as convex if its epigraph is
a convex set [28]. 3, 7, 8, 14, 18, 24

covariance matrix The covariance matrix of a RV x ∈ Rd is defined as

2
 
 T
. 9, 13, 15
 
E x−E x x−E x

data A (indexed) set of data points. 1, 2, 4, 7, 17

data augmentation Data augmentation methods add synthetic data points


to an existing set of data points. These synthetic data points might be
obtained by perturbations (adding noise) or transformations (rotations
of images) of the original data points. 12, 13

data point A data point is any object that conveys information [36]. Data
points might be students, radio signals, trees, forests, images, RVs, real
numbers or proteins. We characterize data points using two types of
properties. One type of property is referred to as a feature. Features
are properties of a data point that can be measured or computed in an
automated fashion. Another type of property is referred to as labels.
The label of a data point represents some higher-level fact (or quantity
of interest). In contrast to features, determining the label of a data point
typically requires human experts (domain experts). Roughly speaking,
ML aims at predicting the label of a data point based solely on its
features. 1–14, 16, 17, 19–21, 23, 24

data poisoning FL methods allow to leverage the information contained in


local datasets generated by other parties to improve the training of a
tailored model. Depending on how much we trust the other parties, FL
can be compromised by data poisoning. Data poisoning refers to the
intentional manipulation (or fabrication) of local datasets to steer the
training of a specific local model [37, 38]. 2

3
dataset With a slight abuse of notation we use the terms “dataset“ or “set
of data points” to refer to an indexed list of data points z(1) , z(2) , . . ..
Thus, there is a first data point z(1) , a second data point z(2) and so on.
Strictly speaking a dataset is a list and not a set [39]. By using indexed
lists of data points we avoid some of the challenges arising in concept
of an abstract set. 2, 3, 5, 7, 9, 10, 12–15, 17

decision region Consider a hypothesis map h that delivers values from a


finite set Y. We refer to the set of features x ∈ X that result in the
same output h(x) = a as a decision region of the hypothesis h. 13

decision tree A decision tree is a flow-chart like representation of a hypoth-


esis map h. More formally, a decision tree is a directed graph which
reads in the feature vector x of a data point at its root node. The root
node then forwards the data point to one of its children nodes based on
some elementary test on the features x. If the receiving children node
is not a leaf node, i.e., it has itself children nodes, it represents another
test. Based on the test result, the data point is further pushed to one of
its neighbours. This testing and forwarding of the data point is repeated
until the data point ends up in a leaf node (having no children nodes).
The leaf nodes represent sets (decision regions) constituted by feature
vectors x that are mapped to the same function value h(x). 5, 10, 11,
13

deep net We refer to an ANN with a (relatively) large number of hidden


layers as a deep ANN or “deep net”. Deep nets are used to represent
the hypothesis spaces of deep learning methods [40]. 11

4
differentiable A function f : Rd → R is differentiable if it has a gradient
∇f (x) everywhere (for every x ∈ Rd ) [5]. 1, 8, 9, 13, 21, 22

discrepancy Consider a FL application with networked data represented by


an empirical graph. FL methods use a discrepancy measure to compare
hypothesis maps from local models at nodes i, i′ connected by an edge
in the empirical graph. 2, 3

edge weight Each edge {i, i′ } of an empirical graph is assigned an edge


weiht Ai,i′ 2, 3, 5, 6, 10

effective dimension The effective dimension deff (H) of an infinite hypoth-


esis space H is a measure of its size. Loosely speaking, the effective
dimension is equal to the number of “independent” tunable parameters
of the model. These parameters might be the coefficients used in a
linear map or the weights and bias terms of an ANN. 10, 11

eigenvalue We refer to a number λ ∈ R as eigenvalue of a square matrix


A ∈ Rd×d if there is a non-zero vector x ∈ Rd \ {0} such that Ax = λx.
3–5, 7, 13

eigenvector An eigenvector of a matrix A is a non-zero vector x ∈ Rd \ {0}


such that Ax = λx with some eigenvalue λ. 3, 4

empirical graph Empirical graphs represent collections of local datasets and


corresponding local models [41]. An empirical graph is an undirected
weighted empirical graph whose nodes carry local datasets and models.
FL methods learn a local hypothesis h(i) , for each node i ∈ V, such that
it incurs small loss on the local datasets. 1–3, 5–11, 13, 14

5
empirical risk The empirical risk of a given hypothesis on a given set of
data points is the average loss of the hypothesis computed over all data
points in that set. 2, 6, 12, 15, 22

empirical risk minimization Empirical risk minimization is the optimiza-


tion problem of finding the hypothesis with minimum average loss (or
empirical risk) on a given set of data points (the training set). Many
ML methods are special cases of empirical risk. 1–3, 5–7, 9, 11, 12, 15,
19, 23

estimation error Consider data points with feature vectors x and label y.
In some applications we can model the relation between features and
label of a data point as y = h̄(x) + ε. Here we used some true hypothesis
h̄ and a noise term ε which might represent modelling or labelling errors.
The estimation error incurred by a ML method that learns a hypothesis
h, e.g., using ERM, is defined as b
b h − h̄. For a parametrized hypothesis
space, consisting of hypothesis maps that are determined by a parameter
vector w, we define the estimation error in terms of parameter vectors
as ∆w = w
b − w. first 7, 8

Euclidean space The Euclidean space Rd of dimension d refers to the space


of all vectors x = x1 , . . . , xd , with real-valued entries x1 , . . . , xd ∈ R,


whose geometry is defined by the inner product xT x′ =


Pd ′
j=1 xj xj

between any two vectors x, x′ ∈ Rd [5]. 3, 7, 19

feature A feature of a data point is one of its properties that can be measured
or computed in an automated fashion. For example, if a data point is a

6
bitmap image, then we could use the red-green-blue intensities of its
pixels as features. Some widely used synonyms for the term feature
are “covariate”,“explanatory variable”, “independent variable”, “input
(variable)”, “predictor (variable)” or “regressor” [42–44]. However, this
book makes consequent use of the term features for low-level properties
of data points that can be measured easily. 1–7, 9–16, 19–21

feature map A map that transforms the original features of a data point
into new features. The so-obtained new features might be preferable
over the original features for several reasons. For example, the shape of
datasets might become simpler in the new feature space, allowing to
use linear models in the new features. Another reason could be that the
number of new features is much smaller which is preferable in terms of
avoiding overfitting. The special case of feature maps that deliver two
numeric features are particulary useful for data visualization. Indeed,
we can then depict data points in a scatterplot by using these two
features as the coordinates of a data point. 13

feature matrix Consider a dataset D of m data points that are characterized


by feature vectors x(1) , . . . , x(m) . It is convenient to collect these feature
vectors into a feature matrix X := x(1) , . . . , x(m) . 2, 7


feature space The feature space of a given ML application or method is


constituted by all potential values that the feature vector of a data
point can take on. Within this book the most frequently used choice
for the feature space is the Euclidean space Rd with dimension d being
the number of individual features of a data point. 9, 10, 13

7
federated learning (FL) Federated learning is an umbrella term for ML
methods that train models in a collaborative fashion using decentralized
data and computation. 1–9, 16, 17

Finnish Meteorological Institute The Finnish Meteorological Institute


is a government agency responsible for gathering and reporting weather
data in Finland. 2, 6, 9, 10, 14

General Data Protection Regulation The General Data Protection Reg-


ulation (GDPR) is a law that has been passed by the European Union
(EU) and put into effect on May 25, 2018 https://gdpr.eu/tag/gdpr/.
The GDPR imposes obligations onto organizations anywhere, so long as
they target, collect or in any other way process data related to people
(i.e., personal data) in the EU. 5

generalized total variation Generalized total variation measures the changes


of vector-valued node attributes of a graph. 6, 9

gradient For a real-valued function f : Rd → R : w 7→ f (w), a vector g such


f (w)− f (w′ )+gT (w−w′ )
that limw→w′ ∥w−w′ ∥
= 0 is referred to as the gradient of
f at w′ . If such a vector exists it is denoted ∇f (w′ ) or ∇f (w) w′
[5].
1–5, 8, 9, 13, 21, 24

gradient descent (GD) Gradient descent is an iterative method for finding


the minimum of a differentiable function f (w). 1, 2, 4, 7, 11, 12, 18

gradient step Given a differentiable real-valued function f (w) and a vector


w′ , the gradient step updates w′ by adding the scaled negative gradient
∇f (w′ ), w′ 7→ w′ − α∇f (w′ ). 1–5, 18

8
gradient-based method Gradient-based methods are iterative algorithms
for finding the minimum (or maximum) of a differentiable objective func-
tion of the model parameters. These algorithms construct a sequence
of approximations to an optimal choice for model parameters that re-
sults in a minimum objective function value. As their name indicates,
gradient-based methods use the gradients of the objective function eval-
uated during previous iterations to construct new (hopefully) improved
model parameters. 1–5, 8, 11, 22

graph A graph G = (V, E) is a pair that consists of a node set V and an


edge set E. In general, a graph is specified by a map that assigns to
each edge e ∈ E a pair of nodes [45]. One important family of graphs
(simple undirected graphs) is obtained by identifying each edge e ∈ E
with two different nodes {i, i′ }. Weighted graphs also specify numeric
weights Ae for each edge e ∈ E. 5, 6

GTV minimization GTV minimization is an instance of RERM using the


GTV of local model parameters as a regularizer. 1, 6–9

hypothesis A map (or function) h : X → Y from the feature space X to the


label space Y. Given a data point with features x we use a hypothesis
map h to estimate (or approximate) the label y using the predicted
label ŷ = h(x). ML is about learning (or finding) a hypothesis map h
such that y ≈ h(x) for any data point. 1–16, 19–23

hypothesis space Every practical ML method uses a specific hypothesis


space (or model) H. The hypothesis space of a ML method is a sub-
set of all possible maps from the feature space to label space. The

9
design choice of the hypothesis space should take into account available
computational resources and statistical aspects. If the computational
infrastructure allows for efficient matrix operations, and there is a (ap-
proximately) linear relation between features and label, a useful choice
for the hypothesis space might be the linear model. 1, 4, 5, 10–14, 16,
20, 23, 24

i.i.d. It can be useful to interpret data points z(1) , . . . , z(m) as realizations of


independent and identically distributed RVs with a common probability
distribution. If these RVs are continuous, their joint probability density
function (pdf) is p z(1) , . . . , z(m) = m with p(z) being the
 Q (r)

r=1 p z

common marginal pdf of the underlying RVs. 3, 6, 10, 11, 13, 21

i.i.d. assumption The i.i.d. assumption interprets data points of a dataset


as the realizations of i.i.d. RVs. 6, 12, 21

interpretability A ML method is interpretable for a specific user if she can


well anticiapte the predictions delivered by the method. The notion of
interpretability can be made precise using quantitative measures of the
uncertainty about the predictions [47]. 12

label A higher level fact or quantity of interest associated with a data point.
If a data point is an image, its label might be the fact that it shows a cat
(or not). Some widely used synonyms for the term label are "response
variable", "output variable" or "target" [42–44]. 1–3, 5, 6, 9–14, 16, 19,
21, 23

10
label space Consider a ML application that involves data points charac-
terized by features and labels. The label space is constituted by all
potential values that the label of a data point can take on. Regres-
sion methods, aiming at predicting numeric labels, often use the label
space Y = R. Binary classification methods use a label space that
consists of two different elements, e.g., Y = {−1, 1}, Y = {0, 1} or
Y = {“cat image”, ”no cat image”} 9

Laplacian matrix The geometry or structure of a graph G can be analyzed


using the properties of special matrices that are associated with G. One
such matrix is the graph Laplacian matrix L which is defined for an
undirected and weighted graph (e.g., the empirical graph of networked
data) [48, 49]. 1, 3, 4, 10

law of large numbers The law of large numbers refers to the convergence
of the average of an increasing (large) number of i.i.d. RVs to the mean
(or expectation) of their common probability distribution. Different
instances of the law of large numbers are obtained using different notions
of convergence. 11

learning rate Consider an iterative method for finding or learning a good


choice for a hypothesis. Such an iterative method repeats similar
computational (update) steps that adjust or modify the current choice
for the hypothesis to obtain an improved hypothesis. A prime example
for such an iterative learning method is GD and its variants. We refer
by learning rate to any parameter of an iterative learning method that
controls the extent by which the current hypothesis might be modified

11
or improved in each iteration. A prime example for such a parameter
is the step size used in GD. Some authors use the term learning rate
mostly as a synonym for the step size of (a variant of) GD 1–3, 11, 22

learning task A learning tasks consists of a specific choice for a collection


of data points (e.g., all images stored in a particular database), their
features and labels. 3, 12

least absolute shrinkage and selection operator (Lasso) The least ab-
solute shrinkage and selection operator (Lasso) is an instance of struc-
tural risk minimization (SRM) for learning the weights w of a linear map
h(x) = wT x. The Lasso minimizes the sum consisting of an average
squared error loss (as in linear regression) and the scaled ℓ1 norm of the
weight vector w. 20

linear model We use the term linear model in a very specific sense. In
particular, a linear model is a hypothesis space which consists of all
linear maps,
H(d) := h(x) = wT x : w ∈ Rd . (10.1)


Note that (10.1) defines an entire family of hypothesis spaces, which is


parametrized by the number d of features that are linearly combined
to form the prediction h(x). The design choice of d is guided by
computational aspects (smaller d means less computation), statistical
aspects (increasing d might reduce prediction error) and interpretability
(a linear model using few carefully chosen features might be considered
interpretable). 1–3, 5–7, 9–11, 15

12
linear regression Linear regression aims at learning a linear hypothesis map
to predict a numeric label based on numeric features of a data point.
The quality of a linear hypothesis map is measured using the average
squared error loss incurred on a set of labeled data points (which we
refer to as training set). 1–4, 7, 8, 12–14

local dataset The concept of a local dataset is in-between the concept of a


data point and a dataset. A local dataset consists of several individual
data points which are characterized by features and labels. In contrast
to a single dataset used in basic ML methods, a local dataset is also
related to other local datasets via different notions of similarities. These
similarities might arise from probabilistic models or communication
infrastructure and are encoded in the edges of an empirical graph. 1–3,
5–9, 13, 14, 16

local model Consider a collections of local datasets that are assigned to the
nodes of an empirical graph. A local model H(i) is a hypothesis space
that is assigned to a node i ∈ V. Different nodes might be assigned

different hypothesis spaces, i.e., in general H(i) ̸= H(i ) for different
nodes i, i′ ∈ V. 1–6, 9, 14

loss With a slight abuse of language, we use the term loss either for the loss
function itself or for its value for a specific pair of a data point and a
hypothesis. 1, 5–15, 19–23

loss function A loss function is a map

 
L : X × Y × H → R+ : x, y , h 7→ L ((x, y), h)

13
which assigns a pair consisting of a data point, with features x and label
y, and a hypothesis h ∈ H the non-negative real number L ((x, y), h).
The loss value L ((x, y), h) quantifies the discrepancy between the true
label y and the predicted label h(x). Smaller (closer to zero) values
L ((x, y), h) mean a smaller discrepancy between predicted label and
true label of a data point. Figure 6 depicts a loss function for a given
data point, with features x and label y, as a function of the hypothesis
h ∈ H. 1–3, 5, 7–9, 13, 14, 20, 21

L ((x, y), h)

hypothesis h

Figure 6: Some loss function L ((x, y), h) for a fixed data point, with feature
vector x and label y, and varying hypothesis h. ML methods try to find
(learn) a hypothesis that incurs minimum loss.

model We use the term model as a synonym for hypothesis space 1, 3–5, 7,
9–12, 16, 24

model parameters Model parameters are numbers that select a hypothesis


map out of a hypothesis space. 1, 2, 7, 9, 12, 13, 18

multivariate normal distribution The multivariate normal distribution


N (m, C) is an important family ofprobability distributions for a con-

14
tinuous RV x ∈ Rd [3, 50, 51]. This family is paramtrized by the mean
m and covariance matrix C of x. If the covariance matrix is invertible,
the probability distribution of x is
 
T −1 
p(x) ∝ exp − (1/2) x − m C x − m .

mututal information The mutual information I (x; y) between two RVs x,


y defined on the same probability space is given by
 
p(x, y)
I (x; y) := E log .
p(x)p(y)

It is a measure for how well we can estimate y based solely from x. A


large value of I (x; y) means that y can be well predicted solely from
x. The prediction could be obtained by a hypothesis learnt by a ML
method. 17

objective function An objective function is a map that assigns each pos-


sible value of an optimization variable, such as the parameters w of a
hypothesis h(w) , to an objective value f (w). The objective value f (w)
could by the the risk or the empirical risk of a hypothesis h(w) . 1, 7, 24

overfitting Consider a ML method that uses ERM to learn a hypothesis


with minimum empirical risk on a given training set. Such a method is
“overfitting” the training set if it learns hypothesis with small empirical
risk on the training set but unacceptably large loss outside the training
set. 7, 10

15
parameters The parameters of a ML model are tunable (learnable or ad-
justable) quantities that allow to choose between different hypothesis
maps. For example, the linear model H := {h : h(x) = w1 x + w2 }
consists of all hypothesis maps h(x) = w1 x + w2 with a particular choice
for the parameters w1 , w2 . Another example of parameters are the
weights assigned to the connections of an ANN. 1, 3, 5–7, 9, 10

polynomial regression Polynomial regression aims at learning a polyno-


mial hypothesis map to predict a numeric label based on numeric
features of a data point. For data points characterized by a sin-
gle numeric features, polynomial regression uses the hypothesis space
(poly)
j=0 x wj }. The quality of a polynomial hypothesis
:= {h(x) = d−1 j
P
Hd
map is measured using the average squared error loss incurred on a set
of labeled data points (which we refer to as training set). 11

positive semi-definite A symmetric matrix Q = QT ∈ Rd×d is referred to


as positive semi-definite if xT Qx ≥ 0 for every vector x ∈ Rd . 3, 7, 8,
13

prediction A prediction is an estimate or approximation for some quantity


of interest. ML revolves around learning or finding a hypothesis map h
that reads in the features x of a data point and delivers a prediction
yb := h(x) for its label y. 5, 6, 12, 15, 20

privacy leakage Consider a (ML or FL) system that processes a local dataset
D(i) and shares data, such as the predictions obtained for new data
points, with other parties. Privacy leakage arises if the shared data
carries information about a private (sensitive) feature of a data point

16
(which might be a human) of D(i) . The amount of privacy leakage can
be measured via mutual information using a probabilistic model for the
local dataset. 17

privacy protection Privacy protection aims at avoiding (or minimizing)


the privacy leakage occuring within data processing systems (such as
ML or FL methods). 1

probabilistic model A probabilistic model interprets data points as realiza-


tions of RVs with a joint probability distribution.This joint probability
distribution typically involves parameters which have to be manually
chosen (=design choice) or learnt via statistical inference methods [29].
1, 7–9, 13, 17, 21

probability density function (pdf ) The probability density function (pdf)


p(x) of a real-valued RV x ∈ R is a particular representation of its prob-
ability distribution. If the pdf exists, it can be used to compute the
probability that x takes on a value from a (measurable) set B ⊆ R
via p(x ∈ B) = B p(x′ )dx′ [3, Ch. 3]. The pdf of a vector-valued RV
R

x ∈ Rd (if it exists) allows to compute the probability that x falls into a


(measurable) region R via p(x ∈ R) = R p(x′ )dx′1 . . . dx′d [3, Ch. 3]. 10
R

probability distribution The data generated in some ML applications


can be reasonably well modelled as realizations of a RV. The overall
statistical properties (or intrinsic structure) of such data are then
governed by the probability distribution of this RV. We use the term
probability distribution in a highly informal manner and mean the
collection of probabilities assigned to different values or value ranges

17
of a RV. The probability distribution of a binary RV y ∈ {0, 1} is fully
specified by the probabilities p(y = 0) and p(y = 1) = 1−p(y = 0) . The


probability distribution of a real-valued RV x ∈ R might be specified by


a probability density function p(x) such that p(x ∈ [a, b]) ≈ p(a)|b − a|.
In the most general case, a probability distribution is defined by a
probability measure [50, 52]. 1, 2, 6, 9–11, 14, 15, 17, 21

projected GD Projected GD extends basic GD for unconstrained opti-


mization to handle constraints on the optimization variable (model
parameters). A single iteration of projected GD consists of first taking
a gradient step and then projecting the result back into a constrain set.
1

proximable A Convex function for which the proximity operator can be


computed efficiently are sometimes referred to as “proximable” or “simple”
[32]. 8

proximity operator Given a convex function and a vector x, we define its


proximity operator as
2
proxLi (·),2λ (w′′ ) := argmin f (w) + (ρ/2) ∥w − w′ ∥2 with ρ > 0.
w∈Rd

Convex functions for which the proximity operator can be computed


efficiently are sometimes referred to as “proximable” or “simple” [32]. 8,
18

quadratic function A quadratic function f (w), reading in a vector w ∈ Rd


as its argument, is such that

f (w) = wT Qw + qT w + a,

18
with some matrix Q ∈ Rd×d , vector q ∈ Rd and scalar a ∈ R. 7, 8, 14

random variable (RV) A random variable is a mapping from a probability


space P to a value space [52]. The probability space, whose elements
are elementary events, is equipped with a probability measure that
assigns a probability to subsets of P. A binary random variable maps
elementary events to a set containing two different values, e.g., {−1, 1}
or {cat, no cat}. A real-valued random variable maps elementary events
to real numbers R. A vector-valued random variable maps elementary
events to the Euclidean space Rd . Probability theory uses the concept
of measurable spaces to rigorously define and study the properties of
(large) collections of random variables [50, 52]. 1–3, 6, 9–12, 15, 17–21,
24

realization Consider a RV x which maps each element (outcome, or ele-


mentary event) ω ∈ P of a probability space P to an element a of a
measurable space N [5, 52, 53]. A realization of x is any element a′ ∈ N
such that there is an element ω ′ ∈ P with x(ω ′ ) = a′ . 1, 11–13, 17, 21

regression Regresion problems revolve around the problem of predicting a


numeric label solely from the features of a data point. 11

regularization Regularization techniques modify the ERM principle such


that the learnt hypothesis performs well (generalizes) beyond the train-
ing set. One specific implementation of regularization is to add a penalty
or regularization term to the objective function of ERM (which is the
average loss on the training set). This regularization term can be in-
terpreted as an estimate for the increase in the expected loss (risk)

19
compared to the average loss on the training set. 1, 4, 7, 9, 10, 12, 13,
19, 20, 22

regularized empirical risk minimization Synoym for SRM. 6, 8, 9

regularizer A regularizer assigns each hypothesis h from a hypothesis space


H a quantitative measure R h for how much its prediction error on a


training set might differ from its prediction errors on data points outside
the training set. Ridge regression uses the regularizer R h := ∥w∥22


for linear hypothesis maps h(w) (x) := wT x [4, Ch. 3]. The least
absolute shrinkage and selection operator (Lasso) uses the regularizer
R h := ∥w∥1 for linear hypothesis maps h(w) (x) := wT x [4, Ch. 3].


6, 8, 9, 13

ridge regression Ridge regression learns the parameter (or weight) vector w
of a linear hypothesis map h(w) (x) = wT x. The quality of a particular
choice for the parameter vector w is measured by the sum of two
components. The first components is the average squared error loss
incurred by h(w) on a set of labeled data points (the training set). The
second component is the scaled squared Euclidean norm λ∥w∥22 with
a regularization parameter λ > 0. It can be shown that the effect
of adding to λ∥w∥22 to the average squared error loss is equivalent to
replacing the original data points by an ensemble of realizations of a
RV centered around these data points. 7, 12–15, 20

risk Consider a hypothesis h that is used to predict the label y of a data


point based on its features x. We measure the quality of a particular
prediction using a loss function L ((x, y), h). If we interpret data points

20
as the realizations of i.i.d. RVs, also the L ((x, y), h) becomes the
realization of a RV. Using such an i.i.d. assumption allows to define the
risk of a hypothesis as the expected loss E L ((x, y), h) . Note that


the risk of h depends on both, the specific choice for the loss function
and the probability distribution of the data points. 1, 6, 9, 15, 19

scatterplot A visualization technique that depicts data points by markers


in a two-dimensional plane. 7

smooth We refer to a real-valued function as smooth if it is differentiable and


its gradient is continuous [54,55]. In particular, a differentiable function
f (w) is refered to as β-smooth if the gradient ∇f (w) is Lipschitz
continuous with Lipschitz constant β, i.e.,

∥∇f (w) − ∇f (w′ )∥ ≤ β∥w − w′ ∥.

1, 3, 5, 24

squared error loss The squared error loss measures the prediction error of
a hypothesis h when predicting a numeric label y ∈ R from the features
x of a data point. It is defined as

2
L ((x, y), h) := y − h(x) . (10.2)
|{z}
=ŷ

2, 3, 6, 7, 10, 13, 16, 20

statistical aspects By statistical aspects of a ML method, we refer to (prop-


erties of) the probability distribution of its ouput given a probabilistic
model for the data fed into the method. 1, 4, 6, 12

21
step size Many ML methods use iterative optimization methods (such as
gradient-based methods) to construct a sequence of increasingly accurate
hypothesis maps h(1) , h(2) , . . .. The rth iteration of such an algorithm
starts from the current hypothesis h(r) and tries to modify it to obtain
an improved hypothesis h(r+1) . Iterative algorithms often use a step
size (hyper-) parameter. The step size controls the amount by which a
single iteration can change or modify the current hypothesis. Since the
overall goal of such iteration ML methods is to learn a (approximately)
optimal hypothesis we refer to a step size parameter also as a learning
rate. 1

stopping criterion Many ML methods use iterative algorithms that con-


struct a sequence of model parameters (such as the weights of a linear
map or the weights of an ANN) that (hopefully) converge to an optimal
choice for the model parameters. In practice, given finite computational
resources, we need to stop iterating after a finite number of times. A
stopping criterion is any well-defined condition required for stopping
iterating. 1–3

strongly convex A continuously differentiable real-valued function f (x) is


strongly convex with coefficient σ if f (y) ≥ f (x) + ∇f (x)T (y − x) +
(σ/2) ∥y − x∥22 [54], [56, Sec. B.1.1.]. 1

structural risk minimization Structural risk minimization is the problem


of finding the hypothesis that optimally balances the average loss (or
empirical risk) on a training set with a regularization term. The regu-
larization term penalizes a hypothesis that is not robust against (small)

22
perturbations of the data points in the training set. 12, 20

training error The average loss of a hypothesis when predicting the labels
of data points in a training set. We sometimes refer by training error
also the minimum average loss incurred on the training set by any
hypothesis out of a hypothesis space. 9–12, 15, 23

training set A set of data points that is used in ERM to learn a hypothesis
ĥ. The average loss of ĥ on the training set is referred to as the training
error. The comparison between training error and validation error of ĥ
allows to diagnose ML methods and informs how to improve them (e.g.,
using a different hypothesis space or collecting more data points). 3,
5–16, 19, 20, 22, 23

validation Consider a hypothesis ĥ that has been learn via ERM on some
training set D. Validation refers to the practice of trying out a hypothesis
ĥ on a validation set that consists of data points that are not contained
in the training set D. 1, 8

validation error Consider a hypothesis b


h which is obtained by ERM on a
training set. The average loss of b
h on a validation set, which is different
from the training set, is referred to as the validation error. 9–12, 15, 23,
24

validation set A set of data points that have not been used as training set
in ERM to learn a hypothesis b
h. The average loss of b
h on the validation
set is referred to as the validation error and used to diagnose the ML
method (see [4, Sec. 6.6.]). The comparison between training error

23
and validation error can inform directions for improvements of the ML
method (such as using a different hypothesis space). 8, 9, 11, 12, 15, 23

variance The variance of a real-valued RV x is defined as the expectation


2
of the squared difference x and its expectation E{x}.

E x − E{x}
2
We extend this definition to vector-valued RVs x as E x − E{x} 2 .


12

weights We use the term weights synonymously for a finite set of parameters
within a model. For example, the linear model consists of all linear
T
maps h(x) = wT x that read in a feature vector x = x1 , . . . , xd of a
data point. Each specific linear map is characterized by specific choices
T
for the parameters for weights w = w1 , . . . , wd . 13

zero-gradient condition Consider the unconstrained optimization problem


minw∈Rd f (w) with a smooth and convex objective function f (w). A
necessary and sufficient condition for a vector w
b ∈ Rd to solve this
problem is that the gradient ∇f wb is the zero-vector,


 
∇f w
b =0⇔f w
b = min f (w).
w∈Rd

24
References
[1] W. Rudin, Real and Complex Analysis, 3rd ed. New York: McGraw-Hill,
1987.

[2] G. H. Golub and C. F. Van Loan, Matrix Computations, 3rd ed. Balti-
more, MD: Johns Hopkins University Press, 1996.

[3] D. Bertsekas and J. Tsitsiklis, Introduction to Probability, 2nd ed. Athena


Scientific, 2008.

[4] A. Jung, Machine Learning: The Basics, 1st ed. Springer Singapore,
Feb. 2022.

[5] W. Rudin, Principles of Mathematical Analysis, 3rd ed. New York:


McGraw-Hill, 1976.

[6] M. Wollschlaeger, T. Sauter, and J. Jasperneite, “The future of industrial


communication: Automation networks in the era of the internet of things
and industry 4.0,” IEEE Industrial Electronics Magazine, vol. 11, no. 1,
pp. 17–27, 2017.

[7] M. Satyanarayanan, “The emergence of edge computing,” Computer,


vol. 50, no. 1, pp. 30–39, Jan. 2017. [Online]. Available: https:
//doi.org/10.1109/MC.2017.9

[8] H. Ates, A. Yetisen, F. Güder, and C. Dincer, “Wearable devices for the
detection of covid-19,” Nature Electronics, vol. 4, no. 1, pp. 13–14, 2021.
[Online]. Available: https://doi.org/10.1038/s41928-020-00533-1

25
[9] H. Boyes, B. Hallaq, J. Cunningham, and T. Watson, “The
industrial internet of things (iiot): An analysis framework,”
Computers in Industry, vol. 101, pp. 1–12, 2018. [Online]. Available:
https://www.sciencedirect.com/science/article/pii/S0166361517307285

[10] S. Cui, A. Hero, Z.-Q. Luo, and J. Moura, Eds., Big Data over Networks.
Cambridge Univ. Press, 2016.

[11] A. Barabási, N. Gulbahce, and J. Loscalzo, “Network medicine: a network-


based approach to human disease,” Nature Reviews Genetics, vol. 12,
no. 56, 2011.

[12] M. E. J. Newman, Networks: An Introduction. Oxford Univ. Press,


2010.

[13] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas,


“Communication-efficient learning of deep networks from decentralized
data,” in Proceedings of the 20th International Conference on Artificial
Intelligence and Statistics, ser. Proceedings of Machine Learning
Research, A. Singh and J. Zhu, Eds., vol. 54. Fort Lauderdale, FL,
USA: PMLR, 20–22 Apr 2017, pp. 1273–1282. [Online]. Available:
http://proceedings.mlr.press/v54/mcmahan17a.html

[14] T. Li, A. K. Sahu, A. Talwalkar, and V. Smith, “Federated learning:


Challenges, methods, and future directions,” IEEE Signal Processing
Magazine, vol. 37, no. 3, pp. 50–60, May 2020.

26
[15] Y. Cheng, Y. Liu, T. Chen, and Q. Yang, “Federated learning for privacy-
preserving ai,” Communications of the ACM, vol. 63, no. 12, pp. 33–36,
Dec. 2020.

[16] N. Agarwal, A. Suresh, F. Yu, S. Kumar, and H. McMahan, “cpSGD:


Communication-efficient and differentially-private distributed sgd,” in
Proc. Neural Inf. Proc. Syst. (NIPS), 2018.

[17] V. Smith, C.-K. Chiang, M. Sanjabi, and A. Talwalkar, “Federated


Multi-Task Learning,” in Advances in Neural Information Processing
Systems, vol. 30, 2017. [Online]. Available: https://proceedings.neurips.
cc/paper/2017/file/6211080fa89981f66b1a0c9d55c61d0f-Paper.pdf

[18] J. You, J. Wu, X. Jin, and M. Chowdhury, “Ship compute


or ship data? why not both?” in 18th USENIX Symposium
on Networked Systems Design and Implementation (NSDI 21).
USENIX Association, April 2021, pp. 633–651. [Online]. Available:
https://www.usenix.org/conference/nsdi21/presentation/you

[19] D. Tse and P. Viswanath, Fundamentals of Wireless Communication.


Cambridge University Press, 2005.

[20] T. Yang, G. Andrew, H. Eichner, H. Sun, W. Li, N. Kong,


D. Ramage, and F. Beaufays, “Applied federated learning: Improving
google keyboard query suggestions,” 2018. [Online]. Available:
https://arxiv.org/abs/1812.02903

[21] A. Ghosh, J. Chung, D. Yin, and K. Ramchandran, “An efficient frame-


work for clustered federated learning,” in 34th Conference on Neural

27
Information Processing Systems (NeurIPS 2020), Vancouver, Canada,
2020.

[22] F. Sattler, K. Müller, and W. Samek, “Clustered federated learning:


Model-agnostic distributed multitask optimization under privacy con-
straints,” IEEE Transactions on Neural Networks and Learning Systems,
2020.

[23] G. Strang, Computational Science and Engineering. Wellesley-


Cambridge Press, MA, 2007.

[24] ——, Introduction to Linear Algebra, 5th ed. Wellesley-Cambridge


Press, MA, 2016.

[25] H. H. Bauschke and P. L. Combettes, Convex Analysis and Monotone


Operator Theory in Hilbert Spaces. New York: Springer, 2011.

[26] F. Pedregosa, “Scikit-learn: Machine learning in python,” Journal


of Machine Learning Research, vol. 12, no. 85, pp. 2825–2830, 2011.
[Online]. Available: http://jmlr.org/papers/v12/pedregosa11a.html

[27] J. Hirvonen and J. Suomela. (2023) Distributed algorithms 2020.

[28] S. Boyd and L. Vandenberghe, Convex Optimization. Cambridge, UK:


Cambridge Univ. Press, 2004.

[29] E. L. Lehmann and G. Casella, Theory of Point Estimation, 2nd ed.


New York: Springer, 1998.

28
[30] A. Esteva, B. Kuprel, R. A. Novoa, J. Ko, S. M. Swetter, H. M. Blau,
and S. Thrun, “Dermatologist-level classification of skin cancer with deep
neural networks,” Nature, vol. 542, 2017.

[31] H. Lütkepohl, New Introduction to Multiple Time Series Analysis. New


York: Springer, 2005.

[32] L. Condat, “A primal–dual splitting method for convex optimization


involving lipschitzian, proximable and linear composite terms,” Journal
of Opt. Th. and App., vol. 158, no. 2, pp. 460–479, Aug. 2013.

[33] N. Parikh and S. Boyd, “Proximal algorithms,” Foundations and Trends


in Optimization, vol. 1, no. 3, pp. 123–231, 2013.

[34] S. Chepuri, S. Liu, G. Leus, and A. Hero, “Learning sparse graphs under
smoothness prior,” in Proc. of the IEEE Int. Conf. on Acoustics, Speech
and Signal Processing, 2017, pp. 6508–6512.

[35] E. Bagdasaryan, A. Veit, Y. Hua, D. Estrin, and V. Shmatikov, “How


to backdoor federated learning,” in Proceedings of the Twenty Third
International Conference on Artificial Intelligence and Statistics, ser.
Proceedings of Machine Learning Research, S. Chiappa and R. Calandra,
Eds., vol. 108. PMLR, 26–28 Aug 2020, pp. 2938–2948. [Online].
Available: https://proceedings.mlr.press/v108/bagdasaryan20a.html

[36] T. M. Cover and J. A. Thomas, Elements of Information Theory, 2nd ed.


New Jersey: Wiley, 2006.

29
[37] X. Liu, H. Li, G. Xu, Z. Chen, X. Huang, and R. Lu, “Privacy-enhanced
federated learning against poisoning adversaries,” IEEE Transactions on
Information Forensics and Security, vol. 16, pp. 4574–4588, 2021.

[38] J. Zhang, B. Chen, X. Cheng, H. T. T. Binh, and S. Yu, “Poisongan:


Generative poisoning attacks against federated learning in edge com-
puting systems,” IEEE Internet of Things Journal, vol. 8, no. 5, pp.
3310–3322, 2021.

[39] P. Halmos, Naive set theory. Springer-Verlag, 1974.

[40] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press,


2016.

[41] O. Chapelle, B. Schölkopf, and A. Zien, Eds., Semi-Supervised Learning.


Cambridge, Massachusetts: The MIT Press, 2006.

[42] D. Gujarati and D. Porter, Basic Econometrics. Mc-Graw Hill, 2009.

[43] Y. Dodge, The Oxford Dictionary of Statistical Terms. Oxford University


Press, 2003.

[44] B. Everitt, Cambridge Dictionary of Statistics. Cambridge University


Press, 2002.

[45] R. T. Rockafellar, Network Flows and Monotropic Optimization. Athena


Scientific, Jul. 1998.

[46] C. Lampert, “Kernel methods in computer vision,” Foundations and


Trends in Computer Graphics and Vision, 2009.

30
[47] A. Jung and P. Nardelli, “An information-theoretic approach to person-
alized explainable machine learning,” IEEE Sig. Proc. Lett., vol. 27, pp.
825–829, 2020.

[48] U. von Luxburg, “A tutorial on spectral clustering,” Statistics and Com-


puting, vol. 17, no. 4, pp. 395–416, Dec. 2007.

[49] A. Y. Ng, M. I. Jordan, and Y. Weiss, “On spectral clustering: Analysis


and an algorithm,” in Adv. Neur. Inf. Proc. Syst., 2001.

[50] R. Gray, Probability, Random Processes, and Ergodic Properties, 2nd ed.
New York: Springer, 2009.

[51] A. Lapidoth, A Foundation in Digital Communication. New York:


Cambridge University Press, 2009.

[52] P. Billingsley, Probability and Measure, 3rd ed. New York: Wiley, 1995.

[53] P. R. Halmos, Measure Theory. New York: Springer, 1974.

[54] Y. Nesterov, Introductory lectures on convex optimization, ser. Applied


Optimization. Kluwer Academic Publishers, Boston, MA, 2004,
vol. 87, a basic course. [Online]. Available: http://dx.doi.org/10.1007/
978-1-4419-8853-9

[55] S. Bubeck, “Convex optimization. algorithms and complexity.” in Foun-


dations and Trends in Machine Learning. Now Publishers, 2015, vol. 8.

[56] D. P. Bertsekas, Convex Optimization Algorithms. Athena Scientific,


2015.

31

You might also like