0% found this document useful (0 votes)

13 views17 pages

Machine Learning and Geodesy A Survey

Uploaded by

oleksandr.m.hrabovyi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views17 pages

Machine Learning and Geodesy A Survey

Uploaded by

oleksandr.m.hrabovyi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

J. Appl.

Geodesy 2021; 15(2): 117–133

Jemil Butt*, Andreas Wieser, Zan Gojcic, and Caifa Zhou

Machine learning and geodesy: A survey

https://doi.org/10.1515/jag-2020-0043 E[X] of the random variable X with independent samples
Received October 7, 2020; accepted January 29, 2021 Xk and realizations xk . This is due to the fact that x̄ is a
Abstract: The goal of classical geodetic data analysis is solution to the optimization problem
often to estimate distributional parameters like expected n
2
values and variances based on measurements that are sub- x̄ = argmin ∑ (x̂ − xk ) (1)
x∈ℝ k=1
ject to uncertainty due to unpredictable environmental ef-
̂

fects and instrument specific noise. Its traditional roots and its performance as measured by the variance of the
and focus on analytical solutions at times require strong residual random variable X̄ − X = n−1 ∑nk=1 Xk − X for any
prior assumptions regarding problem specification and new independent observation X
underlying probability distributions that preclude suc-
2 2
cessful application in practical cases for which the goal is 2 1 n 1 n
not regression in presence of Gaussian noise. σX−X = E [( ∑ Xk − X) ] − E [ ∑ Xk − X]
n k=1 n k=1
̄

Machine learning methods are more flexible with re- 1

spect to assumed regularity of the input and the form of = E[X 2 ] + E[X 2 ] (2)
n
the desired outputs and allow for nonparametric stochas-
tic models at the cost of substituting easily analyzable gets better (lower standard deviation) with increasing
closed form solutions by numerical schemes. This article sample size n. Although typically one expects from ma-
aims at examining common grounds of geodetic data anal- chine learning algorithms a more complex interaction with
ysis and machine learning and showcases applications the data, the above example is instructive in the sense that
of algorithms for supervised and unsupervised learning the task is to be expressed in the language of a probabilis-
to tasks concerned with optimal estimation, signal sep- tically motivated optimization problem upon which it is
aration, danger assessment and design of measurement solved employing numerical routines. The exact way in
strategies that occur frequently and naturally in geodesy. which this optimization is carried out will be of no con-
cern in this article; instead priority is given to an intuitive
Keywords: Geodesy, Adjustment theory, Machine learn- explanation of the correspondence between learning task
ing, Hilbert spaces, Kernel methods and equivalent optimization problem as the authors have
found lacking clarity in this area to be the main impedi-
ment to understanding machine learning algorithms act-
1 Introduction ing in high dimensional or even infinite dimensional set-
tings.
One widely adopted definition of machine learning de- Presumably due to missing or unclear links failing to
scribes it as the study of algorithms whose performance explain the relation between classical adjustment theory
on a specific task increases with experience [22]. Here ex- and machine learning algorithms, the latter are rarely used
perience is usually quantified by the amount of data and in geodesy. Not considering photogrammetry and remote
performance is measured by a function that includes intu- sensing, whose ties particularly to computer vision are un-
ition on when a result of the algorithm could be considered deniably strong and result in extensive usage of learning
desirable. algorithms, references are rather limited.
In this generality, the definition encompasses actions In the field of physical geodesy, numerical reasons
as simple as taking the arithmetic mean x̄ = n−1 ∑nk=1 xk of a arising during manipulation of ill-conditioned normal ma-
dataset {xk }nk=1 as an estimator for the expected value μX = trices for example in gravity field estimation [20] have
lead to widespread use of regularization which — as we
*Corresponding author: Jemil Butt, ETH Zürich, Institute of Geodesy will show in section 2 — is mathematically equivalent to
and Photogrammetry, Stefano-Franscini-Platz 5, CH-8093 Zürich, statistical inference with a prior on some function space.
Switzerland, e-mail: [email protected]
Apart from this slightly hidden link, individual publica-
Andreas Wieser, Zan Gojcic, Caifa Zhou, ETH Zürich, Institute of
Geodesy and Photogrammetry, Stefano-Franscini-Platz 5, CH-8093
tions have addressed directly concrete applications rang-
Zürich, Switzerland, e-mails: [email protected], ing from system identification employing neural-networks
[email protected], [email protected] [24] to the use of support vector machines for velocity
118 | J. Butt et al., ML and geodesy

field interpolation in the context of landslide monitoring at first glance might seem obscure in this setting but will be
[30] and thereby hinted at some of the potential of ma- demonstrated to work reasonably well and arise naturally
chine learning methods for typical geodetic core-tasks. when the viewpoint developed in section 2 is taken.
[29] stake out the role artificial intelligence might have to In those toy problems a dataset containing total sta-
play in geodesy and list several algorithms; however, they tion observations is subjected to a kernel based time series
focus more on possible future developments whereas we analysis to separate signal from noise, classified by a sup-
want to make explicit mathematical equivalences and dif- port vector machine (SVM) as stable or instable and split
ferences in perspective between the data analytical ap- into maximally independent parts by kernel independent
proaches taken in geodesy and machine learning. component analysis (K-ICA). We hasten to note that the ex-
One distinguishes machine learning tasks regarding amples presented in this paper are of an illustrative nature
the given inputs and the desired outputs. When a set of in- before closing with a discussion of the results and an out-
dependent variables xk and corresponding response vari- look on potentially interesting and worthwhile future ap-
ables yk is given in the form of a sequence {(xk , yk )}nk=1 and plications.
the algorithm is supposed to closely emulate the mapping
f : xk 󳨃→ yk the task is said to be supervised [18, pp. 26–28].
When only a sequence {(xk )}nk=1 is given and structure is to
be found without further guidance the task is called un-
2 Adjustment and machine learning
supervised. Many intermediate shades exist between the
We proceed by applying adjustment theory to a simple 1 di-
two extremes; e. g. reinforcement learning in which an al-
mensional regression / interpolation problem. By tackling
gorithm — designed to find optimal strategies in a stochas-
the same task with geostatistical and functional analytic
tically changing environment — receives positive or nega-
methods, the connections to statistical inference and de-
tive feedback but no ground truth or optimal strategy is
terministic function approximation are highlighted. This
known that could serve to construct reference values yk
allows to coach adjustment theory in a learning frame-
[34]. This scheme of clustering machine learning tasks can
work. Both adjustment and machine learning procedures
be contrasted with a more output oriented one, in which a
make use of the same words but their meanings often differ
task is called regression if the output is numerical or clas-
considerably. To alleviate the confusion we will always de-
sification if it is categorical to name only the two most com-
fine the quantities appearing in this chapter strictly math-
mon formats [18, pp. 26–28].
ematically and we try to keep with the usual notational
The premises of geodetic data analysis as embodied
customs of the respective fields as far as no contradictions
by what is known as adjustment theory are typically nar-
arise. Furthermore, we hope that Table 1 provides a guide-
rower [5]: Measurements are sequences of real numbers
line to translate terminology between machine learning
{yk }nk=1 and there exists a set of parameters {λk }m k=1 such
m
and adjustment based approaches to estimation and urge
that its transform A({λk }k=1 ) by some function A resem-
the reader to briefly skim over it before entering the next
bles {yk }nk=n apart from a residual term that is assumed
section. However, it is by no means complete and the
to be entirely stochastic in nature [26, p. 137]. Several ex-
reader will have to fill in some of the missing pieces him
tensions exist most notably among them collocation; see
or herself as he or she advances through the text.
e. g. [23, 6]. The above problem is a supervised regression
The mode of presentation is geared towards parallel-
problem one could equally well tackle with different meth-
ing that of earlier survey articles establishing links be-
ods. In the next section classical least squares solutions
tween processing schemes in geodesy and various other
for a very basic estimation task are rederived from different
disciplines of science; we specifically recommend [16].
starting points. This will reveal differences in philosophies
between geodesy and machine learning regarding how to
pose a problem even though the calculations ultimately 2.1 Regression / interpolation problem
yield the same equation. The equivalence of adjustment to
an algorithm that may be considered as belonging to ma- Suppose n observations {yk }nk=1 are given together with the
chine learning (Gaussian process regression / Kriging) and locations {xk }nk=1 ⊂ X at which they were performed. The
one that surely does so (optimization in reproducing ker- goal is to estimate the values y(x) even for unobserved lo-
nel Hilbert spaces / splines) is shown and augmented with cations x ∈ X , see Fig. 1 for an illustration.
a Bayesian interpretation. Section 3 is devoted to toy ex- A typical set of assumptions and procedures to de-
amples from geodesy that defy being solved by adjustment rive a solution within an adjustment theoretic framework
theory and require algorithms from machine learning that would consist in the items listed below.
J. Butt et al., ML and geodesy | 119

λ∗ = (AT Σ−1 −1 T −1
v A) A Σv y. Then λ is an m-dimensional vec-
∗

tor, Aλ is an n-dimensional vector representing estima-

∗

tions of the noiseless ytrue at the observed locations and

ŷtrue (⋅) = ∑m j=1 λj gj (⋅) is a function of x ∈ X . The family
∗

{gj (⋅)}mj=1 of functions used to approximate y only enters the

problem formulation via the matrix A where each row of A
is a row vector aTk of the possibly nonlinear functions gj (⋅)
acting on x; (aTk )j = gj (xk ). The estimators ŷtrue (⋅) for dif-
ferent choices of function classes {gj }m j=1 (linear, cubic) are
shown in Fig. 1.
The probabilistic interpretation is quite straightfor-
Figure 1: A one dimensional illustration of the regression / interpo-
ward. Under the assumption that the stochastic model of v
lation problem posed above. Without prior knowledge it is not clear
which estimator — represented here by the various broken lines — is
being multivariate Gaussian with E[v] = 0 and E[v ⊗ v∗ ] =
the most appropriate. Σv is correct, one may write [27, p. 68]

L(λ, v) = fv (v|λ)
1
i) Assume there is an underlying deterministic function = (2π)−n/2 √det Σv exp [− v(λ)T Σ−1 v v(λ)]
2
of x depending linearly on a set of m parameters λ, i. e.
log fv (v|λ) = c1 − c2 ([Σ−1/2
v v(λ)]T [Σ−1/2
v v(λ)])
(ytrue )i = ∑m
j=1 λj gj (xi ) = Aλ with the n × m Matrix A
having entries (A)ij = gj (xi ). = c1 − c2 ‖Σ−1/2
v (Aλ − y)‖2ℓ2 (4)
ii) The deviations between ytrue and y are due to measure-
where c1 and c2 are constants and fv (v|λ) is the conditional
ment noise which is assumed to be multivariate Gaus-
probability density function of the random variable v rep-
sian with expected value zero and covariance matrix
resenting the residuals due to measurement error given pa-
Σv , preferably diagonal.
rameters λ and the distributional information about their
iii) Minimize the weighted sum of squares vT Σ−1 v v of resid-
means and covariances. Since log(⋅) is a monotonous func-
uals v(λ) = Aλ − y by choosing the optimal set λ∗ of
tion, the maximizer of log fv is also the maximizer of fv and
parameters λ.
the likelihood L(λ, v) implying that the least squares solu-
tion is a maximum likelihood estimator. Note at this point
We arrive at the following-Gauss Markov model [26, p. 137]:
that the likelihood L(λ, v) = fv (v|λ) is proportional to fλ (λ|v)
Aλ − y = v E[v] = 0 E[vi vj ] = (Σv )ij via Bayes rule [27, p. 60]
n T
y∈ℝ y = [y1 , . . . , yn ] yk = k-th observation ∞ −1

λ ∈ ℝm λ = [λ1 , . . . , λm ]T λk = k-th parameter fλ (λ|v) = fv (v|λ)fλ (λ) [ ∫ fv (v|λ)fλ (λ)dλ]

[−∞ ]
A ∈ ℝn ⊗ ℝm A = [a1 , . . . , an ]T
under an assumed uniform distribution for λ. Then the
with ai = [ai1 , . . . , aim ]T and (Aλ)k = aTk λ.
maximum likelihood estimate is actually the Bayesian
Bar some technicalities regarding invertibility of AT A maximum a posteriori estimate. Equation (4) conse-
the solution can be written as the estimator ŷtrue (x) = quently establishes a link between maximum likelihood
∑m estimation, norm minimization and, in special cases,
j=1 λj gj (x) where the parameters are optimal in the
∗

sense of being a minimizer for the discrepancy measure Bayesian inference.

‖Σ−1/2
v (ŷtrue − y) ‖2 , i. e.

λ∗ = argmin‖Σ−1/2
v (Aλ − y)‖2ℓ2 2.2 Adjustment as a learning task
λ∈ℝm
̃ − ỹ‖22
= argmin‖Aλ The adjustment approach to interpolation can be identi-
fied as a supervised regression problem. The set of tuples
ℓ
λ∈ℝm

= A ỹ
̃+ (3) {(xk , yk )}nk=1 are the training data, the goal is to approx-
imate the input-output relation between the {xk }nk=1 and
where A ̃ = Σ−1/2 A, ỹ = Σ−1/2 y and A
v v
̃ + is the pseudoinverse {yk }nk=1 where the decision variable is the target vector λ.
of A
̃ [33, p. 218]. Therefore the well known formula for λ∗ is Essentially nothing changes, if the pretext of an artificial
120 | J. Butt et al., ML and geodesy

interpolation problem is dropped; the solution of a linear 2.3 Adjustment, geostatistics and splines
adjustment problem in Gauss-Markov form can always be
written as [13, p. 93] In geostatistics, to solve the interpolation problem, one
would assume the observations yk to be realizations of a
m
stochastic process {Y(xk )}nk=1 with Y(x) ∈ L2 (Ω) a square
y(⋅)
̂ = ∑ λk∗ gk (⋅) (5)
k=1 integrable random variable for all x ∈ X and an estimator
λ∗ = argmin‖A(x)λ − y‖2H (6) Y(x)
̂ for Y(x) in general is sought. Assemble this estima-
λ∈ℝm tor as a function of the given random variables {Y(xk )}nk=1
in such a way as to minimize the expected square loss
where ⟨f , g⟩H = ⟨Σ−1 f , g⟩ℓ2 is the inner product in some 2
E[(Y(x)
̂ − Y(x)) ] which is the error variance of the estima-
Hilbert space. Here we wrote A(x) to explicitly document
tion.
that the design matrix contains nonlinear features in x —
It can be proven [25] that for a zero-mean Gaus-
a notion that is quite straightforward to interpret in the in-
terpolation case as sian process the best predictor Ŷ functionally dependent
on some set Yk = Y(xk ), k = 1, . . . , n is the condi-
g1 (x1 ) ... gm (x1 ) tional expectation, which is furthermore linear in its argu-
[
A(x) = [ .. .. ] ] ments.
[ . . ]
g 1 (x n) ... gm (xn )]
[ Y(x)
̂ = E[Y(x)|Y1 , . . . , Yn ] (9)
in this case contains e. g. polynomials in x. However in ar- n
Y(x)
̂ = ∑ αk Yk (10)
bitrary abstract adjustment problems, it might not always
k=1
be easy to identify what the independent variable {xk }nk=1
corresponds to, if just A(x) as a matrix of features is pro- Presupposing knowledge of the mean-zero joint Gaus-
vided. When for example the levelling problem (7) sian distribution, denote by σ(Y(x1 ), Y(x2 )) the covariance
E[Y(x1 )Y(x2 )] of the two random variables Y(x1 ), Y(x2 ) ∈
1 0 0 H1∗
[ 1 −1 0 ] H1 L2 (Ω); x1 , x2 ∈ X . To find these α for which Y(x)
̂ = ∑nk=1 αk Yk
[ ] [Δh1 ]
[ ]
[
[
]
] [ H2 ] ≈ [ ] (7) is the conditional expectation, minimize
[0 1 −1] [Δh2 ]
[H3 ]
[−1 0 1]
⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟ ⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟ [ Δh3 ]
⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟ E[(Y(x)
̂ 2
− Y(x)) ] = σ(Ŷ − Y, Ŷ − Y) =: σα2 (v(x)).
A(x) λ y

H1∗ : Approximately known height Since the covariance σ(⋅, ⋅) is bilinear in its arguments, this
Hk : Heights to be determined amounts to solving 𝜕/𝜕αk σα2 (v(x)) = 0, k = 1, . . . , n with
Δhk : Measured height difference (8) n n
σα2 (v(x)) = σ (∑ αi Yi − Y, ∑ αj Yj − Y) (11)
is given, it is quite hard to interpret the rows of A(x) as i=1 j=1
nonlinear features of some scalar x. However, we might n n n
always resort to the mental trick of considering the rows = ∑ ∑ αi αj σ(Yi , Yj ) − 2 ∑ αi σ(Yi , Y) + σ(Y, Y)
i=1 j=1 i=1
of A(x) as linear features of a vector valued independent
variable x ∈ ℝm . Concretely this means having as train- This immediately implies
ing data {(xk , yk )}nk=1 = {([1 0 0]T , H1∗ ), ([1 − 1 0]T , Δh1 ), . . .}
and approximating a function f : ℝ3 → ℝ that maps the 𝜕 2 n
σα (v(x)) = 2 [∑ αi σ(Yi , Y) − σ(Yi , Y)] = 0
!
xk to the yk linearly, i. e. f (xk ) = f ([xk1 , xk2 , xk3 ]T ) = H1 xk1 + 𝜕αk i=1
H2 xk2 + H3 xk3 . But this equation just defines a hyperplane
in ℝ4 indicating that the adjustment problem has been and α consequently satisfies
reduced to a simple regression in a higher dimensional
space. σ(Y1 , Y1 ) . . . σ(Y1 , Yn ) α1 σ(Y1 , Y(x))
We present for comparison a geostatistical and a func-
[
[ .. .. .. ][ . ] [
][ . ] = [ .. ]
] (12)
[ . . . ][ . ] [ . ]
tional analytic approach, that both enjoy some popular- σ(Y , Y ) . . . σ(Y , Y ) α σ(Y , Y(x))
[ n 1 n n ] ⏟⏟⏟⏟⏟⏟⏟⏟⏟
⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟ [ n ] ⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟
[ n ]
ity in the machine learning community under the names Σ α Σx
of Gaussian process regression and splines in reproducing
kernel Hilbert space. The equations will largely be identi- The above formulae are known as the simple Kriging equa-
cal but the spirit is noticeably different. tions [10, p. 152]. Solving this system leads to the optimal
J. Butt et al., ML and geodesy | 121

choice of coefficients α for assembling the simple Krig- Summarizingly, from the geostatistical perspective,
ing predictor Ŷ SK = ∑nk=1 αk Yk out of measurements Yk , the inclusion of randomness results in a more flexible
k = 1, . . . , n and finally model for the predictions and residuals. This contrasts
with the randomnesses role as a cover term to sub-
Ŷ SK (x) = αT {Yk }nk=1 = ΣTx Σ−1 {Yk }nk=1 . (13) sume unwanted and unmodelled effects in terms of de-
viations from the parametric model in classical adjust-
In the case where also the mean function is unknown, ment.
needs to be estimated and has form h(x) = ∑m l=1 βl gl (x), the In the approach described above, we minimized
universal Kriging system [10, p. 168] arises instead:
E[(Y(x)
̂ − Y(x))2 ] = ‖Y(x)
̂ − Y(x)‖2L2 (Ω)
Σ A α Σ
[ T ][ ] = [ x] (14)
A 0 μ Ax pointwise for each x ∈ X separately to derive a predic-
tor Y(x)
̂ because we took as fundamental the notion of
where α, Σ, Σx are defined as in equation (12), μ is some
a random variable and its variance. It is possible to ab-
m-dimensional Lagrange multiplier, (A)ij = gj (xi ) and
stract from this situation by introducing spaces H(X ) of
(Ax )j = gj (x) defines a column vector. For a fixed x ∈ X ,
functions f : X → ℝ with Gaussian measures on these
the optimal estimator is the universal Kriging predictor
spaces [21] which allow writing the probability of having
Ŷ UK = ∑nk=1 αk Yk with the α chosen to satisfy the system
a randomly drawn f ∈ H(X ) in the subset Q ⊂ H(X )
of linear equations specified above.
as
By solving system (14) via substitution, the coefficient
vector α = [α1 , . . . , αn ]T is found explicitly and the estima- P(f ∈ Q) = ∫ dν(f )
tor can be decomposed into three components. Q

T −1 T −1
α = Σ [Σx − A(A Σ A) (A Σ Σx − Ax )]
−1 −1
where the right hand side is an integral through func-
tion space against some measure ν. Under certain as-
ΣTx Σ−1 {Yk }nk=1
Ŷ UK (x) = ⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟ (15) sumptions [14, p. 9] H(X ) turns out to be completely
Ŷ1 (x) determined by its covariance operator — an infinite di-
ATx (AT Σ−1 A)−1 AT Σ−1 {Yk }nk=1
+ ⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟ mensional analogue of the covariance matrix satisfying
Cf : H(X ) ∋ g 󳨃→ Cf g = E[⟨f , g⟩H f ] ∈ H(X ) which in turn
Ŷ2 (x)
T −1 T −1 −1 T −1 n
is completely specified once the second moment function
− Σ x Σ A(A Σ A) A Σ {Yk }k=1
⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟ K(x1 , x2 ) = E[f (x1 )f (x2 )] is known [3, p. 29].
Ŷ3 (x)
The space H(X ) can be shown to be the reproducing
kernel Hilbert space HK with reproducing kernel K(⋅, ⋅) :
Comparing the above terms to equations (13) and (3), we
X × X → ℝ. In this function space the norm ‖f ‖HK is in-
find that
versely related to its probability of occurrence [14, p. 19].
Ŷ 1 (x) = Ŷ SK (x) Ŷ 2 (x) = Ŷ Adjustment (x) This leads one to formulate the estimation problem glob-
ally for all x ∈ X simultaneously as
and Ŷ 3 (x) is a cross term accounting for the fact that the es-
timated mean Ŷ Adjustment (x) needs to be subtracted for nor- σf = argmin ‖f ‖2HK (17)
f ∈HK :Lf ={yk }nk=1
malization. An alternative way of writing (15) would there-
fore be where σf (⋅) is then called an interpolating spline. The op-
erator L : HK → ℝn goes by the name of measurement
Ŷ UK (x) = Ŷ Adjustment (x) + V̂ SK (x) (16) operator and relates the function f (⋅) to the observed val-
ues yk , k = 1, . . . , n. It is simple evaluation in this case,
where V(x) = Y(x) − Ŷ Adjustment (x), the residual af- i. e. (Lf )j = yj = Lj f . Minimization of ‖f ‖HK is reasonable
ter subtraction of the estimated mean function h(x) = as the whole problem (17) then translates to finding that
ATx (AT Σ−1 A)−1 AT Σ−1 {Yk }nk=1 . We find the main difference to function f which is yk at the positions xk and is most likely
adjustment to be the existence of an estimation term for a as described by some Gaussian measure on the Hilbert
stochastic component owing to the fact that what we want space HK .
to estimate is only somewhat correlated to the measure-
ments.
122 | J. Butt et al., ML and geodesy

If we decide to drop the interpolating conditions and tasks the quantity to be optimized is often the likelihood
replace them with the constraint that Lf be “close” to of residuals whose assumed Gaussian distribution yields
the observations {yk }nk=1 as measured for example in the the classical least squares formulations. Estimation tasks
ℓ2 -norm, the smoothing spline equation (18) ensues. arising in machine learning seem to less often make a
strict distinction between deterministic signal and ran-
σf = argmin‖Lf − {yk }nk=1 ‖2ℓ2 + ‖f ‖2HK (18) dom noise and at times avoid making use of distribu-
f ∈HK
tional assumptions altogether. Instead, they communicate
It balances fidelity to the data and likelihood of the chosen an estimator’s desirability via an objective function that
function. The explicit solution is given by [4, p. 161] is not in all cases stochastically motivated. This leads to a
wider variety of estimators whose properties are less well-
n
known but interesting nonetheless. As shown in the pre-
σf (⋅) = ∑ λj Lj K(⋅, ⋅) (19)
j=1
vious equations (14) and (18), norm minimization tasks of
the type
λ = (Σ + I)−1 {yk }nk=1 (20)
σf = argmin‖Af − y‖2ℓ2 + ‖f ‖2HK (21)
where I is the n × n unit matrix and (Σ)ij = K(xi , xj ). For a f ∈HK
specific σf (x) one gets
correspond to optimal estimation in presence of white
n noise on the measurements Af of a stochastic process f
σf (x) = ∑ λj K(xj , x) = ΣTx Σ−1 {yk }nk=1 with covariance function K(⋅, ⋅). This correspondence ex-
j=1
tends uniquely to an adjustment problem Aλ − y = v with
under interpolating conditions — this is just the simple white noise v on the measurements and a prior that fa-
Kriging estimator if K(x1 , x2 ) = E[Y(x1 )Y(x2 )]. Extensions vors small lengths of the coefficient vector λ. Even though a
to account for unknown means are standard and ulti- prior on coefficient vectors λ with (Aλ)k = (∑m j=1 λj gj (xk ))k ≈
mately yield the same predictions Ŷ Spline (x) = σf (x) as uni- yk seems — at least from this perspective — puzzling at
versal Kriging [3, pp. 88–91]. first, it enters naturally if one assumes that the linear com-
Finally notice that at the {xk }nk=1 for which observa- bination ∑m j=1 λj gj (⋅) is itself chosen randomly with the λj ’s
tions are available distributed as multivariate Gaussian.
This opens up interpretations of further machine
Ŷ Spline (xk ) = Ŷ UK (xk ) = Ŷ Adjustment (xk ) + V̂ k = yk learning methods that are similar in flavour to the abstract
spline problem (21). Consider for example
where V̂ is the residual Aλ∗ − {yk }nk=1 from the adjustment
procedure. This allows the conclusion that splines and Ridge regression: σf = argmin‖Af − y‖2ℓ2 + α‖Bf ‖2ℓ2
f ∈ℝm
Kriging as representers of machine learning approaches
on the one hand and adjustment as a representer of clas- LASSO: σf = argmin‖Af − y‖2ℓ2 + α‖f ‖ℓ1
f ∈ℝm
sical geodetic techniques on the other hand are basically
Elastic net: σf = argmin‖Af − y‖2ℓ2 + α1 ‖f ‖2ℓ2 + α2 ‖f ‖ℓ1
equivalent bar the philosophical difference of what is con- f ∈ℝm
sidered uninteresting noise to be discarded and what is
not. [15, pp. 61, 68, 118] where the α’s are some positive con-
Another difference is that in the adjustment formula- stants that determine if faithfulness to the data or regular-
tion, the decision variable is a parameter vector λ which ity of the estimator are prioritized and B is some linear op-
determines a function f whereas in the machine learning erator. In the above, ‖ ⋅ ‖ℓp denotes the classical ℓp norms,
formulation the function f is itself the decision variable to i. e.
be determined via optimization.
m
‖f ‖ℓp = √p ∑ |fk |p .
k=1
2.4 Connection to other norm-based
algorithms Note that the ℓp norms are nonnegative functions of f ; con-
sequently minimizing them is equivalent to maximizing
Defining estimators as solutions to norm minimization a likelihood. This holds since for any nonnegative func-
problems is a common method of formalization in both tion q(f ) ≥ 0 ∀f ∈ ℝm satisfying additional constraints
geodesy and machine learning. In geodetic estimation exp(−q(f )) is normalizable with c−1 = ∫ℝm e−q(f ) df < ∞
J. Butt et al., ML and geodesy | 123

implying that c exp(−q(f )) is a valid probability density

function. Therefore to each norm type there corresponds
a unique probability density function: to the ℓ2 norm one
may associate the multivariate Gaussian and to the ℓ1 norm
a multivariate version of the Laplacian distribution. See
Fig. 2 below for some sketches of the respective norms and
densities in the instructive 1 dimensional case.

Figure 2: The ‖ ⋅ ‖ℓ1 and ‖ ⋅ ‖ℓ2 norms and their corresponding proba-
bility densities associated with the Gaussian and Laplacian distribu- Figure 3: The ℓ1 -norm based estimation is much more robust than
tion. Note the Laplacian’s heavier tails. the ℓ2 -norm based estimation. The lower panels show the resid-
uals between coordinates in system 2 and the coordinates trans-
formed into system 2 from system 1 via transformations derived
from Eq. (22). The scale in the lower panels is identical.
The Gaussian pdfs derivative at its mean is zero; the
pdfs value converges to zero extraordinarily fast. The
Combining I and II explains why ℓ1 -norm minimization
Laplacian pdf in contrast has heavy tails but its derivative
leads to sparse and robust estimators that can systemat-
at the mean is undefined. We extract the following from
ically outperform ℓ2 -norm based least squares solutions.
our discussion and the images in Fig. 2:
Therefore ridge regression might be seen as adjustment
I When minimizing the ℓ2 -norm or equivalently max-
with a prior on the length of Bλ, LASSO has a sparsity prior
imizing the likelihood under a Gaussian pdf, small
on the parameter vector λ and elastic net regularization
residuals are considered almost irrelevant since the
balances both. To obtain the usual interpretations, swap
gradient of ‖ ⋅ ‖2ℓ2 around 0 is zero. Large deviations
f for λ in the above and assume the stochastic process f to
are punished disproportionately strong: During mini-
be determined by a multivariate Gaussian on Bf , sparse or
mization decreasing a big residual is considered more
a combination of both.
favourable than decreasing several small ones by the
We demonstrate performance difference of ℓ2 and
same amount. 1
ℓ -norm based estimation in the presence of outliers for
II When minimizing the ℓ1 -norm or equivalently max-
the typical geodetic task of inferring a Helmert transfor-
imizing the likelihood under a Laplacian pdf, small
mation with fixed scale from coordinate measurements in
residuals are punished less than big ones but still
Fig. 3. We briefly sketch the algorithm used to find the op-
severely as the gradient of ‖ ⋅ ‖ℓ1 around 0 is constant
timal transformation A(λ∗ ) with
and positive driving either f to sparsity (if ‖f ‖ℓ1 → min)
or leading to sparse residuals (if ‖Af − y‖ℓ1 → min). λ∗ = argmin ‖A(λ)x − y‖ℓp p = 1, 2 (22)
λ=[xA ,yA ,φA ]∈ℝ3
Big residuals are penalized proportionally: decreasing
a big residual is as good as decreasing an already small that maps the coordinates x in system 1 onto the coordi-
residual by the same amount. nates y in system 2:
124 | J. Butt et al., ML and geodesy

Table 1: Correspondence of terminology in machine learning and adjustment.

Terminology or quantity Role in ML Role in adjustment

Target vector y in ‖Aq − y‖ → min. y is a vector of observations yk , k = 1, . . . , n. y is a vector of observations yk , k = 1, . . . , n.
Decision variable q in ‖Aq − y‖ → min. q = f is the vector of function values to be q = λ is a vector of parameters used to
estimated; no parametric form is assumed. construct a function f (⋅) = ∑m
j=1 λj gj (⋅).

Operator A in ‖Aq − y‖ → min. A is an operator that maps f onto A is a matrix whose rows are vectors of
measurements of f and emulates the way that (nonlinear) transformations of the points xk ,
observations y are generated. Af then typically k = 1, . . . , n. Aλ then typically is
is f evaluated at points xk , k = 1, . . . , n. A is f (⋅) = ∑mj=1 λj gj (⋅) evaluated at the points xk .
called the measurement operator. A is called the design matrix.
p
Term ‖Bq‖rr in ‖Aq − y‖p + ‖Bq‖rr → min. With q = f , ‖Bf ‖rr is a regularization term that Since q = λ is a vector of parameters, there
includes a prior on the function f into the seems hardly any justification for penalizing
estimation of f . The exact nature of the prior terms ‖Bλ‖rr . With B = αI, α > 0 and r = 2 they
depends on norm r and energy operator B. may be introduced for numerical reasons
under the name of Tikhonov regularization.
Randomness and residuals The quantity f to be estimated is assumed to The quantities λ to be estimated are assumed
come from a stochastic process. Unmodelled to have fixed deterministic values.
effects can be pushed onto f during estimation Randomness is a property of the residuals
but f is very flexible. The residuals v = Af − y v = Aλ − y that act as a flexible catch all term
are random too; f and v are distinguishable subsuming all effects unaccounted for by the
only via their correlation structure. parametric model.
Features A feature is a potentially infinite dimensional During the construction of the design matrix
vector that contains (nonlinear) A, the concept is used implicitly. Each of its
transformations of the input variable x, i. e. rows can be interpreted as a feature in some
g(x) = {gj (x)}∞
j=1 for some set of functions gj . input variable x.
Representations A representation of a dataset {yk }nk=1 is a choice The choice of a good representation is left to
of basis functions {gj }∞
j=1 such that each yk is the practitioner, whose responsibility it is to
representable as a combination of gj ’s. A either determine a set of function {gj }m j=1 such
representation can be determined that ∑mj=1 λj gj (xk ) approximates yk or derive
automatically by solving an optimization them from the geometrical of physical
problem. configuration of the task.
Note that many special procedures exist, which is why our explanations are geared to a proper description of only a simple subset of tasks that
might be formulated as the minimization of discrepancy and irregularity measures. We consider these to be a good first order approximation
to many commonly encountered problems in both fields.

1. Get initial solution: λ0 ∈ ℝ3 . 3 Learning algorithms and toy

Set up problem: yk = A(λk )x, Δyk = yk − y.
2.
3. Estimation step: Δλ∗ = argmin‖DA[λk ] Δλ − Δyk ‖ℓp .
applications
Δλ∈ℝ3
k+1 k After having related adjustment theory to learning, we pro-
4. Update step: λ = λ + Δλ . Repeat steps 2–4 until
∗

convergence. ceed to explain three algorithms which do not have an ex-

act analogue within the bounds of the adjustment frame-
In the above, D denotes the differential with respect to the work. Since therefore necessarily the arguments and cal-
parameters. The initial solution can be guessed via an ini- culations deviate from classical material, intuition is pro-
tial least squares step or by solving a subproblem which vided, as to why the methods work and how they are to be
is neither over- nor underdetermined. The minimization applied in practice. To underline the latter, brief and sim-
problem in step 3. is either solved analytically (ℓ2 -norm) ple — but from a least squares perspective nontrivial — toy
or via linear programming (ℓ1 -norm) [7, p. 294]. examples from geodesy are tackled in a fashion emphasiz-
We close this section by stating in the following Table 1 ing approximate interrelations between concrete task and
an approximate correspondence between terminology and methodological approach rather than rigour.
some quantities roles in machine learning and geodetic es- The preceding section made use of a function K : X ×
timation. X → ℝ to represent an estimator f ̂ for a function f in terms
J. Butt et al., ML and geodesy | 125

of basis functions Kx (⋅) := K(x, ⋅) evaluated at the sample have at the same time ϕx1 (⋅) as an element of a (repro-
points {xk }nk=1 . The stochastically optimal choice for this ducing kernel) Hilbert space HK [32, p. 39] and as an
so called kernel function turned out to be given by the infinite set of X -parametrized powers of x1 . As ϕx1 (⋅)
covariance function K(x1 , x2 ) = E[Fx1 Fx2 ] where ∀c ∈ X , contains nonlinear information about x1 it is called a
Fx : Ω ∋ ω 󳨃→ Fxω ∈ ℝ was a square integrable random (nonlinear) feature of x1 .
variable indexed by the space variable x ∈ X .
This view immediately suggests to generalize the finite It should now be clear that ϕx (⋅) ∈ HK is an infinite di-
dimensional covariance matrix ΣF of a random vector F mensional representation of x ∈ X that for specific choices
taking values in ℝn , n < ∞ and satisfying of K(⋅, ⋅) can even encode all the information possibly to
be known about x ∈ X [12]. We will call this the feature-
⟨ΣF g, h⟩ℝn = ⟨E[F ⊗ F ∗ ]g, h⟩ℝn interpretation of a kernel in what follows.
= E[⟨F, g⟩ℝn ⟨F, h⟩ℝn ] ∀g, h ∈ ℝn , (23) The reader is advised to not mix up both interpreta-
tions as the implied objects of investigation are different.
towards the typically infinite dimensional covariance op- The covariance-interpretation assumes the measured ob-
erator CF exhibiting an exactly analogue relationship [3, jects to be (nonlinear) functions f : X → ℝ with f ∈
p. 29]. This can be done by defining it as the selfadjoint HK subjectable to linear operations only. In contrast to
positive definite kernel operator CF : HK ∋ g 󳨃→ (CF g)(⋅) := this, the feature-interpretation assumes the measured ob-
∫X K(x, ⋅)g(x)dx ∈ HK . jects to be x ∈ X and embeds them nonlinearly in HK
As such K(⋅, ⋅) takes the role of a function determining for some reproducing kernel Hilbert space HK . Both per-
the entries in an infinite dimensional covariance matrix spectives rely on the algebraic and geometric properties
that will intuitively be recognized by the practical geode- of the involved reproducing kernel Hilbert spaces (RKHS),
sist as a natural extension of the already known frame- whose internal structure as determined by the kernel K(⋅, ⋅)
works to function space valued estimation problems. For impacts the form of estimators and feature embeddings
the remainder of the paper we term this way of think- alike.
ing about a kernel the covariance-interpretation. There is, We proceed to apply both high dimensional embed-
however, a second radically different perspective onto ker- ding philosophies to supervised and unsupervised prob-
nels that is used concurrently in machine learning [32, lems and start with the more familiar one of interpreting
p. 39] and emphasizes the meaning of K(x, ⋅) = ϕx (⋅) ∈ HK kernels as generators of covariance matrices.
as a Hilbert space valued nonlinear feature of x ∈ X .
An instructive way to illustrate this consists in two
separate steps that are roughly sketched for the special 3.1 Application 1: signal separation for total
case of a Gaussian kernel K(x1 , x2 ) = exp(−‖x1 − x2 ‖2 ) for
station data when the covariances are
x1 , x2 ∈ X ⊂ ℝ.
1. Rewrite −‖x1 − x2 ‖2 as −‖x1 ‖2 + 2⟨x1 , x2 ⟩ − ‖x2 ‖2 and
known
substitute this term in the exponential expression for Suppose a total station was set up as depicted in Fig. 4 to
K(x1 , x2 ) to derive monitor the movement of a prism. To not unnecessarily
2 complicate this example, it will be assumed to hold that
K(x1 , x2 ) = c(x2 )e2x1 x2 −x1 (24)
no tilt occurs and movement of the prism is constrained
∞
Hn (x2 ) n ∞
to purely lie in the x-direction implying a one dimensional
= c(x2 ) ∑ x1 = ∑ αn (x2 )x1n
n=0
n! n=0 formulation to be sufficient. The measurements are sup-
plied in the form of a time series of x-coordinates.
where Hn (⋅) is the n-th Hermite polynomial [31, p. 456]
and αn (x2 ) := c(x2 )Hn (x2 )/n! is a function solely de- Given: A sequence of times tj ∈ T and corresponding
pending on x2 . measurements mj ∈ ℝ of the x-coordinate in the format
2. Notice that K(x1 , x2 ) is effectively a linear superposi- {(tj , mj )}nj=1 where n is the number of measurements.
tion of monomials x1n , n ∈ ℕ0 where the coefficient
Goal: Split the signal into separate parts that are in a
vector {αn }n∈ℕ0 depends on the exact value of x2 . As
stochastically reasonable way optimally identifiable with
x2 is varied to x2󸀠 the coefficient vector changes as well
noise, atmospheric influences and true x-coordinate.
resulting in K(x1 , x2󸀠 ) being a different linear combina-
tion of powers of x1 . When x2 ∈ X is not fixed at all, Assumption: The measurements mj are realizations of
then K(x1 , ⋅) = ϕx1 is a function from X to ℝ and we square integrable random variables Mt⋅j : Ω ∋ ω 󳨃→ Mtωj ∈ ℝ
126 | J. Butt et al., ML and geodesy

for an example see Fig. 5. The algorithm is quite robust to

misspecification of the kernels as long as the correlations
structures of the individual mixture components to be sep-
arated are appropriately encoded in the kernels.
The extension to vector valued and tensor valued
splines is straightforward and much effort has been put
forward to guarantee practical computability and stabil-
ity of the numerical schemes that nowadays incorporate
Figure 4: The setup for the signal separation problem in section 3.1.
many ideas from finite element analysis and spectral the-
The measurements of the prisms changing x-coordinates contain
atmospheric influences and noise. ory, see [4].

for all tj ∈ T; i. e. {Mt : t ∈ T} is a stochastic process and 3.2 Application 2: signal separation for total
separable in the following way: station data when the covariances are
unknown
Mt = Nt + At + Xt (25)
In what follows, the requirements on prior knowledge are
where the stochastic processes Nt , At , Xt correspond to
relaxed and the covariance structure is no longer assumed
noise, atmosphere and x-coordinate respectively and are
to be known. In only presupposing the measurements as
independent from each other. Their covariance functions
being made up of statistically independent parts, we pass
(=kernels KN , KA , KX ) are assumed to be either known ap-
from a supervised to an unsupervised learning problem
proximately or inferrable from the different time scales of
that has no analytical solution anymore. Before describ-
Nt , At , Xt that find their expression in the decay character-
ing and applying K-ICA in this setting, it is instructive to
istics of the kernels.
explain the most common measures for characterizing in-
Main idea: The measurements M⋅⋅ : Ω ∋ ω 󳨃→ M⋅ω ∈ ℝT dependence of random variables.
are assumed to lie in some infinite dimensional Hilbert Two random variables X, Y : Ω → ℝ are called in-
space HM with kernel KM = KN + KA + KX which implies dependent if — bar some technicalities concerning mea-
that HM is the direct sum of the Hilbert spaces contain- surability and continuity — their joint probability density
ing pure noise, atmospheric influences and x-coordinates function fXY (x, y) factors into the product of its marginals;
in the sense that HM = HN ⊕ HA ⊕ HX . For a more rigor- i. e. fXY (x, y) = fX (x)fY (y) [1, p. 91]. One writes X ∐ Y in this
ous account, see [21]. An optimal interpolating spline σm case. It is well known that two jointly multivariate Gaus-
is found such that σm (tj ) perfectly coincides with the mea- sian distributed random variables X and Y are uncorre-
surements at times tj . lated if and only if they are independent [27, p. 71]. Gen-
erally, however, two random variables X, Y may have zero
σm (⋅) = argmin ‖m‖2HM (26) correlation without necessarily being independent. In the
m∈HM :m(tj )=mj
n
non-Gaussian case the covariance function
ij
σm (⋅) = ∑ λj KM (tj , ⋅) with λ = (KM )−1 m (27)
j=1 cov(X, Y) = ∫ (x − E[X])(y − E[Y])fXY (x, y)dxdy
ℝ2
ij
whereby KM is the matrix with entries KM (ti , tj ) ∈ ℝn ⊗ ℝn
and m ∈ ℝn is the n-dimensional vector containing the is therefore a necessary but insufficient indicator of inde-
measurements. Notice that σm (⋅) is not a number but a pendence and needs to be replaced by the entropy-based
function of t ∈ T. Subsequently σm (⋅) ∈ HM will be or- mutual information I(X, Y).
thogonally projected onto the subspaces HN , HA and HX The mutual information between two random vari-
to yield the optimal estimators σn = ΠN σm , σA = ΠA σm , ables is defined as [19]
σx = ΠX σm . fXY (x, y)
I(X, Y) = ∫ fXY (x, y) log [ ] dxdy (28)
Results: When reliable covariance information is avail- fX (x)fY (y)
ℝ2
able, the results are stochastically optimal under a Gaus-
sian process assumption [28, p. 27]. Also on a purely vi- where the expression is evaluated as an appropriate limit
sual level the outcome of applying the estimation proce- in any pathological cases. It can be shown that I(X, Y) =
dure above to an exemplary dataset seems reasonable — 0 ⇔ X ∐ Y and otherwise I(X, Y) > 0.
J. Butt et al., ML and geodesy | 127

Figure 5: An example of signal separation. The left panels show the true underlying ground truth (synthetic data), that is superimposed
to generate the signal plotted in the center. This time series is the input for the RKHS based estimation framework outlined in section 3.1
whose output are the estimations visible on the right side. The scale is the same for the six outer plots.

Therefore from a theoretical perspective if one wanted cient finite dimensional implementation of this infinite di-
to split a timeseries {mj }nj=1 linearly into independent com- mensional problem.
ponents {aj }nj=1 and {xj }nj=1 one could try to minimize the Suppose, a total station S was set up as depicted
mutual information between the {aj }nj=1 and the {xj }nj=1 in Fig. 6 to monitor the movement of two prisms P1 , P2
which are assumed to be realizations of random variables mounted on a planar structure subject to a translational
A and X with significantly different probability distribu- rigid, but time dependent change of coordinates. To keep
tions. Typically estimating the mutual information empir- the example simple, only the x-coordinates will be inves-
ically is hard, however, and it is more common to instead tigated to arrive again at a one dimensional formulation
maximize a contrast function ρ(A, X) that convincingly that parallels the one presented in section 3.1 but with in-
measures how different the distribution of A is from that creased difficulty due to the absence of any knowledge of
of X. Such contrast functions ρ are typically derived from a the correlation structure of the signals to be separated.
Taylor expansion of −I(⋅, ⋅) in terms of features of probabil-
ity distributions (e. g. third moment or higher order cumu-
lants) that partially emulate the property of −I(A, X) being
biggest for fA being very different from fX [2].
It is entirely possible to apply the aforementioned in-
finite dimensional embedding of probability distributions
into an RKHS HK in this setting. An efficiently computable
measure of dependence to be minimized is then given by
the HK -correlation

E[g1 (A)g2 (X)]

ρHK (A, X) = sup (29)
‖gj ‖HK =1,j=1,2 √E[g (A)2 ]E[g (X)2 ] Figure 6: The x coordinate of the prisms P1 , P2 are measured employ-
1 2
ing a total station positioned at S. The distances d1 , d2 and angles
φ1 , φ2 are not assumed to be known.
for functions g1 , g2 that are already centered in feature
space HK [2]. ρHK is to be interpreted as the maximally
achievable correlation between nonlinear transformations
of the random variables A and X where optimization is car- Given: A sequence of times tj ∈ T and corresponding se-
ried out over the class of nonlinear transformations. Mak- quences of measurements mkj ∈ ℝ of the x-coordinates
ing use of the reproducing property of the prism Pk , k = 1, 2. The totality of measurements is
summarized in the sequence {(tj , m1j , m2j )}nj=1 where n is the
⟨g, ϕP ⟩HK = EP [g(X)] number of measurements.
for ϕP = EP [K(X, ⋅)] expectation w.r.t to the probability Goal: Split the signal {(m1j , m2j )}nj=1 into two separate parts
measure P the kernel trick allows computationally effi- that are in a stochastically reasonable way identifiable
128 | J. Butt et al., ML and geodesy

with atmospheric influences and the true x-coordinates of At with Xt ∐ At , a 2 × 2 matrix W with WMt = Ŷ t max-
P1 , P2 . imally independent in the sense of mutual information
would solve the problem apart from the usual ambiguities
Assumption: The measurements mkj are realizations of
encountered during ICA [17].
square integrable random variables Mtkj for all tj ∈ T, i. e.
To approximately achieve this, minimize the
{Mtk : t ∈ T} are two stochastic processes which we assume Hk -correlation
to be linear mixtures of deformations and atmospheric in-
fluences: E[f (X̂ t )g(Â t )]
ρHk (X̂ t , Â t ) = sup
f ,g∈Hk √E[f 2 (X̂ )]E[g 2 (Â )]
Mt1 = q11 Xt + q12 At t t

Mt2 = q21 Xt + q22 At

for some kernel k(s, t) determining the flexibility of permis-
where the stochastic processes Xt and At correspond to sible transformations f , g ∈ Hk . As explained in Eq. (29),
x-coordinates and atmospheric influences respectively. In this is a measure of mutual dependency. To minimize it,
short vector notation and with obvious identifications, one employ the code provided by [2] consisting of iteratively
may write executing the following three steps starting from an initial
guess of W:
Mt = QYt (30) I Whiten the data and construct the kernel matrices Kl ,
l = 1, 2 with elements (K1 )ij = k(X̂ ti , X̂ tj ) and (K2 )ij =
instead. For this model to be reasonable, it is necessary
k(Â t , Â t ) for i, j = 1, . . . , n where X̂ t and Â t are derived
that the whole planar structures motion is sufficiently well i j

described by a translation to guarantee that the behaviour from Ŷ t = WMt . Then center the kernel matrices.
of the two prisms x-coordinates is identical. Furthermore II Solve the regularized kernel canonical correlation
the atmospheric conditions need to be constant over the generalized eigenvalue problem
whole spatial domain to ensure that their influence on the
0 K1 K2 λ
measurement series {mkj }nj=1 is representable as terms q12 At [ ][ ]
K2 K1 0 μ
and q22 At linearly related to some underlying scalar At .
(K1 + αI)2 0 λ
Further explanation: The atmospheric conditions are al- = ρW [ 2] [ ]
0 (K2 + αI) μ
lowed to vary in time. We may calculate the entries of Q
based on knowledge of the geometrical configurations and
for ρW to determine the Hk correlation dependent on
usual formulas for distance reduction of electrooptic mea-
the matrix W.
surements by noting that the atmospheric correction Δxk
III Minimize −1/2 log λW where λW is the smallest of the
satisfies
generalized eigenvalues ρW by gradient descent on the
Δxk = Δdk cos φk manifold of orthogonal matrices.
= α(Temperature,Pressure)
⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟ d k cos φ2
⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟
At qk2 Results: When the two underlying stochastic processes Xt
and At generate non-Gaussian data and are governed by
where α is a meteorological correction factor, see for exam- probability distributions which are reasonably well dis-
ple [35, p. 310]. This would allow us to solve the problem tinguishable via linear combinations of higher order sta-
immediately by inverting Q and applying Q−1 =: W to the tistical moments, the splitting achieved via kernel ICA is
sequences of measurements — however we do not want to convincing. For an exemplary application to a simulated
do this but demand that the algorithm finds the most prob- dataset, see Fig. 7.
able decomposition based not on physically or geometri- The situation exhibited in Fig. 6 is not entirely realis-
cally motivated knowledge but solely on the probabilistic and would need to be modified for any actual appli-
tic assumption that the x-coordinates and atmosphere are cation in practice — however the idea of splitting several
stochastically independent of each other. Neither Xt nor At
sequences of measurements into maximally independent
are allowed to be Gaussian since approximate stochastical
components is a promising one.
independence will be achieved by maximizing some mea-
The framework is applicable whenever measurements
sure of non-Gaussianity [8].
generate for each point in time a whole vector of values
Main idea: Since the measurements Mtk : Ω ∋ ω 󳨃→ and there are reasons to suspect that each entry in that vec-
Mtk (ω) ∈ ℝ are supposedly both linear mixtures of Xt and tor is a linear mixture of quantities of actual interest. This
J. Butt et al., ML and geodesy | 129

yi = +1}, χ− := {xi ∈ χ : yi = −1} where χ is the set of all

x-values in the training set. If it is possible to find a hyper-
plane P(ℝm ) in ℝm which separates χ+ from χ− then P(ℝm )
is called a separating hyperplane and χ+ and χ− are termed
linearly separable in ℝm . An illustration for the case m = 2
is found in Fig. 8.

Figure 7: Simulated atmospheric effects and deformations in forms

of time series (row 1) are mixed together with the matrix [1 0.5; 1 0.8]
(row 2). The unimixing was done with the K-ICA procedure outlined Figure 8: The left panel shows two sets χ+ and χ− which are linearly
in this section (row 3). After normalization, the estimated mixing separable in ℝ2 whereas the same cannot be said for the situation
matrix is [1 − 0.4; 1 − 0.7]. The scale is arbitrary but identical for all on the right side. Elements of χ+ are marked as disks, elements of χ−
subplots. as circles.

The figure also exhibits an example, in which χ+ and

may open up not only new purely statistical signal sepa- χ− are not linearly separable in ℝ2 . However, after center-
ration procedures but also suggest different mensuration ing and a nonlinear transformation of type ϕ : (x1 , x2 ) 󳨃→
strategies that are explicitly meant to measure only indi- x12 + x22 associating to each x its distance to the origin, χ+
rectly the quantities of interest in the form of easily acces- and χ− are linearly separable in the feature space ℝ1 via a
sible linear mixtures and infer them later on via optimiza- separating hyperplane — simply thresholding in this low
tion in a separate post-processing phase. This promotes a dimensional case.
rather opportunistic viewpoint similar to the one taken in This suggests again a kernelization approach: Instead
compressive sensing that contrasts starkly with the clas- of trying to find a separating hyperplane P(ℝm ) in the in-
sical geodetic perspective, in which the quantities of input space ℝm containing the xi , map xi into some infi-
terest are supposed to be the direct outcomes of measure- nite dimensional Hilbert space Hk of features by setting
ments. ϕ : ℝm ∋ x 󳨃→ ϕ(x) = k(x, ⋅) ∈ Hk for some reproducing
kernel k(⋅, ⋅) and search for a separating hyperplane in the
RKHS Hk instead.
3.3 Application 3: classification of total A hyperplane P(Hk ) in Hk is completely specified by
station data a normal vector f ∈ Hk and a (positive or negatively
weighted) distance d to the origin.
Suppose now that for a time series of noisily gathered co-
ordinate measurements a decision is sought as to judge if P(Hk ) := {g ∈ Hk : ⟨f , g⟩Hk + d = 0}
it is indicative of harmful deformations or not. Assume fur-
Depending on what side of P(Hk ) a point h = ϕ(x) ∈ Hk
ther that only exemplary time series are provided to which
comes to lie, it is predicted to either belong to class χ+ or
a label is assigned that classifies them as belonging to a
χ− via
harmless (−1) or dangerous (+1) situation. One commonly
used way to solve supervised classification tasks like this ŷ = sign (⟨f , h⟩Hk + d) = sign (⟨f , k(x, ⋅)⟩Hk + d)
is to use support vector machines, whose basic principle
we briefly outline in what follows. which also directly gives the decision rule for classifying
Suppose the set of pairs {(xi , yi )}ni=1 , xi ∈ ℝm , yi ∈ previously unencountered inputs x ∈ ℝm [32, p. 190]. Find-
{−1, +1} are the training examples and let χ+ := {xi ∈ χ : ing an approximately separating hyperplane P(Hk ) in the
130 | J. Butt et al., ML and geodesy

RKHS Hk that balances the number of misclassifications

and the regularity of P(Hk )’s backprojection into ℝm is ap-
proximable by the optimization problem [15, p. 428]
n
̂ j )]+ + α‖f ‖2H
(f ̂, d)̂ = argmin ∑ [1 − yj y(x (31)
k
f ∈Hk ,d∈ℝ j=1

where [ ⋅ ]+ denotes the positive part and y(x) ̂ =

sign (f (x) + d). Consequently Lj (f ) = [1 − yj ŷj (xj )]+ is a pos-
itive functional of f for each j quantifying the classifica-
tion error. The norm ‖f ‖Hk has the same interpretation as
in section 2.4. The positive parameter α balances fidelity to
the data and regularity [15, p. 424]. Notice that it is entirely
possible to swap Lj (f ) in equation (31) for a quadratic error
term and recover the smoothing spline equation (18).
Equivalently one may solve the quadratic program
n
1 n n
maximize ∑ λj − ∑ ∑ λ λ y y K(xi , xj ) (32)
n
λ∈ℝ
j=1
2 i=1 j=1 i j i j
n
subject to 0 ≤ λj ≤ C , ∑ λj yj = 0
j=1

for some positive C dependent on the parameter α from

equation (31) [32, p. 205] [15, p. 420]. Efficient algorithms
Figure 9: The uppermost panel exhibits the sequence of deforma-
are available to solve this problem for the parameters tion measurements which are to be evaluated as either dangerous
{λj }nj=1 which then are used to assemble the class estima- or not. Some of the training data is presented in the lower two pan-
tor els. The scale is the same for all plots of training data.

n
y(x)
̂ = sign (∑ λj yj K(x, xj ) + d) (33)
j=1 Assumption: The set of time series χ+ associated to dan-
gerous behaviour is approximately linearly separable from
with d = 1/yi − f (xi ) for any i = 1, . . . , n. This classifier is
the set χ− after embedding it into the infinite dimensional
termed support vector machine and we will apply it imme-
Hilbert space Hk of features via the map
diately to the problem outlined before.
Let the time series in Fig. 9 be the input x for our clas- ϕ : ℝm ∋ c 󳨃→ ϕ(x) = k(x, ⋅) ∈ Hk .
sification problem; the sets χ+ and χ− providing exemplary
time series associated to harmful and harmless situations Furthermore assume that the euclidean distance is a
are sampled there as well by listing some representatives. meaningful measure of closeness between time series. Us-
Given: A sequence {xj }nj=1 of deformation measurements age of the linear kernel k(x1 , x2 ) = ⟨x1 , x2 ⟩ℝm derived from
the inner product in ℝn is then justified.
xj = {xji }m m i n
i=1 ∈ ℝ at times ti ∈ T in the format {(ti , xj )}i=1
m
where the sequence xj ∈ ℝ is the interesting part and the Main idea: Solve the optimization problems specified in
time information will regularly be discarded. There is fur- equations (31) or (32) to find a parameter vector λ and a
thermore a training set of examples {(xj , yj )}nj=1 where again constant d such that the classifier ŷ assembled from λ and
each xj is a time series and yj is the corresponding label. d according to equation (33) has both acceptable regularity
and misclassification rate on the training set. Afterwards,
Goal: Emulate the input-output behaviour mapping time
apply ŷ : ℝm → {−1, +1} to unseen time series to classify
series onto danger assessments via the class prediction
them.
function y(x)
̂ defined in equation (33). It makes use of the
RKHS Hk of functions on ℝm with reproducing kernel k(⋅, ⋅) Results: Support vector machines usually perform reason-
that maps pairs of time series onto a real number quanti- ably well although more sophisticated methods exist for
fying their similarity. function approximation problems [9]. Table 2 summarizes
J. Butt et al., ML and geodesy | 131

the SVM’s behaviour in terms of errors of the first and sec- ∀x+ ∈ χ+ and x− ∈ χ− because ⟨f ̃, Ax⟩ℝ = ⟨AT f ̃, x⟩ℝm for
ond kind. For the estimation of empirical error probabil- any A : ℝm → ℝ. For a simple example like this, embed-
ities the cycle of simulating ground truth, fitting an svm dings into infinite dimensional Hk are unnecessary. When
and classifying 100 randomly chosen time series was rerun the underlying classification rule (=failure mechanism in
100 times while the amount of training examples was sub- our example) is complicated or unknown and danger as-
jected to systematic change. Classification was done using sessment is demanded based only on a sequence of mea-
the Matlab built-in “fitclinear”. surements somewhat correlated with the reasons for crit-
ical behaviour, they may however prove helpful. [32] pro-
Table 2: Performance of SVM’s for the specific task outlined above. vide some examplary applications that go into this direc-
tion and demonstrate the usefulness of including kernel-
error samples based nonlinearities into estimation.
10 102 103 104 It is possible to establish that the inner-product-based
type I in % 6.6 1.8 0.6 0.2 decision rule for linear SVM’s is the same as the Bayes rule
type II in % 6.7 2.0 0.6 0.2
Empirically estimated probabilities of type I error (incorrect rejection log(fY|X (y = +1|x)fY|X
−1
(y = −1|x)) ≷ 0 ⇒ y(x)
̂ = ±1
of null hypothesis H0 ) and type II error (failure to reject incorrect null
hypothesis H0 ). H0 is the hypothesis that y(x) = −1. for some semiparametric probability density function fY
whose parameters haven been inferred via Maximum Like-
lihood estimation [11]. This is obviously a form of likeli-
We want to close this section with a few clarifying hood ratio test as employed for comparing two statistical
remarks regarding simulation methodology and a link to models in classical hypothesis testing.
classical hypothesis testing.
This example is again purely synthetic. We randomly
sampled from a stochastic process that corresponds to
Brownian motion, each realization was considered to be
4 Conclusion and outlook
a time series xj ∈ ℝm of deformation measurements. If the
In this paper, we investigated the interface between geode-
best fitting line through {(ti , xji )}m
i=1 had positive slope, the
tic data analysis and machine learning algorithms. It
situation was classified as dangerous and harmless other-
turned out that adjustment as used in the geodetic com-
wise. This generation rule for our synthetic ground truth
munity can be interpreted as a learning algorithm via
was not communicated to the SVM however, which only
proper relabeling of the terms occuring in the optimiza-
received the labeled training examples and had to infer the
tion task arising during maximum likelihood estimation
rule by itself. Notice that even for the trivial finite dimen-
under assumption of Gaussianity. This was exemplified in
sional kernel k(xi , xj ) = ⟨xi , xj ⟩ℝm the limit performance
a simple application, in which adjustment, geostatistics
should be almost perfect separation since the underlying
and splines were employed for regression and interpola-
true classification rule is
tion purposes. They were shown to essentially agree when
Ax ≥ 0 ⇒ y(x) = +1 applicable. A table was provided that served as a guide-
Ax < 0 ⇒ y(x) = −1 line to translate between adjustment theoretic and ma-
chine learning motivated treatments of estimation prob-
where A : ℝm → ℝ is a linear operator consisting of a
lems.
concatenation of line fitting and calculation of the deriva-
Apart from the different role of stochasticity in both
tive of that line — both operations being linear in the data.
fields, one of the main differences is the focus on high di-
Therefore Aχ+ is linearly separable from Aχ− in ℝ1 , and the
mensional embeddings of data. It was outlined, how infi-
underlying decision rule can be written as
nite dimensional problems can be efficiently solved using
⟨f ̃, Ax+ ⟩ℝ ≥ 0 kernels and some intuition was gathered by tackling a se-
⟨f ̃, Ax− ⟩ℝ < 0 quence of instructive albeit simple geodetic toy problems
— not all of which were known to be easily solvable. The al-
∀x+ ∈ χ+ and x− ∈ χ− where f ̃ is any nonzero number. This gorithms are shown to be demonstrably easy to implement
implies for f = AT f ̃ ∈ ℝm the equivalent decision rule with further examples freely available on GitHub.1
⟨f , x+ ⟩ℝm ≥ 0
⟨f , x− ⟩ℝm < 0 1 https://github.com/jemil-butt/ML_tutorials_geodesy
132 | J. Butt et al., ML and geodesy

We speculate that an influx of ideas and procedures References

developed in the machine learning community into the set
of methods finding widespread usage in geodesy is bound
[1] R. B. Ash, Basic Probability Theory, Courier Corporation, New
to be beneficial particularly in the following subfields:
York, 2008.
Mensuration design: The existence of numerical algo- [2] F. R. Bach and M. I. Jordan, Kernel independent component
rithms for approximate optimization in connection to analysis, J. Mach. Learn. Res., 3 (2002), pp. 1–48.
difficult nonlinear tasks with many decision variables [3] A. Berlinet and C. Thomas-Agnan, Reproducing Kernel
implies the possibility to adapt measurement strate- Hilbert Spaces in Probability and Statistics, Springer, Berlin
Heidelberg, 2011.
gies dynamically as data comes in, as for example
[4] A. Y. Bezhaev and V. A. Vasilenko, Variational Theory of
might be the case in monitoring scenarios. Further-
Splines, 1st ed., Springer, Berlin Heidelberg, 2013, softcover
more solutions to previously untackled problems in reprint of hardcover.
estimation and inference might relax constraints usu- [5] K. Borre, ed., The Adjustment Procedure in Tensor Form,
ally imposed on instrument and campaign setups. Springer, Berlin Heidelberg, Berlin, Heidelberg, 2006,
Data analysis: Regression and classification, supervised, pp. 17–21.
[6] K. Borre, ed., Some Remarks About Collocation, Springer
unsupervised and reinforcement learning are tools of
Berlin Heidelberg, Berlin, Heidelberg, 2006, pp. 261–272.
which only the first one is commonly exploited in [7] S. P. Boyd and L. Vandenberghe, Convex Optimization,
the geodetic community. Whereas the impact of better Cambridge University Press, Cambridge, 2004.
classification methods as generalizations of rigorous [8] J. Cardoso, Blind signal separation: statistical principles,
hypothesis testing is to a certain degree predictable, Proceedings of the IEEE, 86 (1998), pp. 2009–2025.
[9] R. Caruana and A. Niculescu-Mizil, An empirical comparison
unsupervised and reinforcement learning as frame-
of supervised learning algorithms, in Proc. 23 rd Intl. Conf.
works for optimal decision making and pattern recog-
Machine learning (ICML’06, 2006, pp. 161–168.
nition in situations involving uncertainty provide ex- [10] J.-P. Chiles and P. Delfiner, Geostatistics – Modeling Spatial
citing opportunities to solve new and seemingly ill Uncertainty, John Wiley & Sons, New York, 2012.
posed estimation problems. [11] V. Franc, A. Zien and B. Schölkopf, Support vector
machines as probabilistic models, in Proceedings of the
28th International Conference on International Conference
However, the at times less rigorously stochastic approach
on Machine Learning, ICML’11, Omnipress, USA, 2011,
of machine learning algorithms implies weaknesses in di- pp. 665–672.
agnosing distributional characteristics and derived error [12] K. Fukumizu, A. Gretton, B. Schölkopf and B. K.
bounds for the outputs. This is a limitation in need of recti- Sriperumbudur, Characteristic kernels on groups and
fication before widespread use becomes feasible in a disci- semigroups, in in Advances in Neural Information Processing
Systems 21, D. Koller, D. Schuurmans, Y. Bengio and L. Bottou,
pline as dependent on reliability as geodesy. To a lesser de-
eds., Curran Associates, Inc., 2009, pp. 473–480.
gree, it is also expected that increasingly instruments may
[13] E. W. Grafarend and B. Schaffrin, Ausgleichungsrechnung in
arise whose measurements only yield the target quantities linearen Modellen, BI-Wissenschaftsverlag, Mannheim, Wien,
after a costly optimization — a trade off between post pro- Zürich, 1993.
cessing and instrument complexity. As the performance of [14] M. Hairer, An Introduction to Stochastic PDEs, ArXiv e-prints,
estimation and inference grows, physically motivated for- 2009.
[15] T. Hastie, R. Tibshirani and J. Friedman, The Elements of
ward models for instrument errors or the behaviour of ob-
Statistical Learning – Data Mining, Inference, and Prediction,
served objects in general might to a certain degree gradu- Springer, Berlin Heidelberg, 2013.
ally be replaced by data driven stochastic approximations. [16] U. C. Herzfeld, Least-squares collocation, geophysical inverse
Visualization may be aided by classification and cluster- theory and geostatistics: a bird’s eye view, Geophysical Journal
ing algorithms which also regularly prove useful for data International, 111 (1992), pp. 237–249.
[17] A. Hyvärinen and E. Oja, Independent component analysis:
exploration and knowledge discovery. We see less poten-
algorithms and applications, Neural Networks, 13 (2000),
tial in the less processing dominated domains of geodesy
pp. 411–430.
— those dealing particularly with the development of the- [18] G. James, D. Witten, T. Hastie and R. Tibshirani, An
oretical models or infrastructural and legislative aspects. Introduction to Statistical Learning – with Applications in R,
Springer, Berlin Heidelberg, 2013.
Acknowledgment: The authors acknowledge the work of [19] A. Kraskov, H. Stögbauer and P. Grassberger, Estimating
mutual information, Phys. Rev. E, 69 (2004) 066138,
the two anonymous reviewers who contributed to this pa-
arXiv:cond-mat/0305641.
per by providing suggestions on content and formatting [20] J. Kusche and R. Klees, Regularization of gravity field
of the paper thereby improving its readability and correct- estimation from satellite gravity gradients, Journal of Geodesy,
ness. 76 (2002), pp. 359–368.
J. Butt et al., ML and geodesy | 133

[21] F. Larkin, Gaussian measure in Hilbert space and applications [30] B. Riedel and M. Heinert, An adapted support vector machine
in numerical analysis, Rocky Mountain J. Math., 2 (1972), for velocity field interpolation at the Baota landslide, in Proc.
pp. 379–422. Application of Artificial Intelligence in Engineering Geodesy –
[22] T. M. Mitchell, Machine Learning, McGraw-Hill, New York, 1st International Workshop, Vienna, 2008, pp. 101–116.
1997. [31] S. Roman, Advanced Linear Algebra, Springer, Berlin
[23] H. Moritz, Advanced least-squares methods, Reports of the Heidelberg, 2007.
Department of Geodetic Science, 175, (1972). [32] B. Schoelkopf and A. J. Smola, Learning with Kernels –
[24] H. Neuner, Model selection for system identification by means Support Vector Machines, Regularization, Optimization, and
of artificial neural networks, Journal of Applied Geodesy, 6 Beyond, MIT Press, Cambridge, 2002.
(2012), pp. 117–124. [33] G. Strang, Introduction to Linear Algebra,
[25] J. Neveu, Processus aléatoires gaussiens, 1968. Wellesley-Cambridge Press, Wellesley, 2016.
[26] W. Niemeier, Ausgleichungsrechnung – Statistische [34] M. Wiering and M. v. Otterlo, Reinforcement Learning –
Auswertemethoden, Walter de Gruyter, Berlin, 2008. State-of-the-Art, Springer, Berlin Heidelberg, 2012.
[27] S. J. Press, Applied Multivariate Analysis – Using Bayesian [35] B. Witte and P. Sparla, Vermessungskunde und Grundlagen
and Frequentist Methods of Inference, 2nd Edition, Courier der Statistik für das Bauwesen, Vde Verlag GmbH, Berlin,
Corporation, New York, 2012. Offenbach, 2015.
[28] C. E. Rasmussen and C. K. I. Williams, Gaussian Processes for
Machine Learning, MIT Press, Cambridge, 2006.
[29] A. Reiterer, U. Egly, T. Vicovac, E. Mai, S. Moafipoor, D.
Grejner-Brzezinska and C. Toth, Application of artificial
intelligence in geodesy – a review of theoretical foundations
and practical examples, Journal of Applied Geodesy, 4 (2010),
pp. 201–217.

Mathematical Optimization: Fundamentals and Applications
From Everand
Mathematical Optimization: Fundamentals and Applications
Fouad Sabry
No ratings yet
Random Optimization: Fundamentals and Applications
From Everand
Random Optimization: Fundamentals and Applications
Fouad Sabry
No ratings yet
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet
Ex TrendSurface
No ratings yet
Ex TrendSurface
79 pages
Applied Spatial Statistics and Econometrics 1st Edition Katarzyna Kopczewska PDF Download
No ratings yet
Applied Spatial Statistics and Econometrics 1st Edition Katarzyna Kopczewska PDF Download
62 pages
Petroleum Reservoir Modeling and Simulation. Geology, Geostatistics, and Performance Reduction 1st Edition - Ebook PDF PDF Download
100% (1)
Petroleum Reservoir Modeling and Simulation. Geology, Geostatistics, and Performance Reduction 1st Edition - Ebook PDF PDF Download
50 pages
Behrens Et Al. 2018 - Spatial Modelling With Euclidean Distance Fields and Machine Learning
No ratings yet
Behrens Et Al. 2018 - Spatial Modelling With Euclidean Distance Fields and Machine Learning
14 pages
A Regression-Kriging Model For Estimation of Rainf
No ratings yet
A Regression-Kriging Model For Estimation of Rainf
10 pages
Multivariate Geostatistics - An Introduction With Applications-Springer Berlin Heidelberg (1995)
No ratings yet
Multivariate Geostatistics - An Introduction With Applications-Springer Berlin Heidelberg (1995)
263 pages
A Geostatistical Approach
No ratings yet
A Geostatistical Approach
22 pages
10 1016@j Ecolmodel 2020 109354
No ratings yet
10 1016@j Ecolmodel 2020 109354
10 pages
Probabilistic Machine Learning An Introduction Book 1 (Kevin P Murphy)
100% (1)
Probabilistic Machine Learning An Introduction Book 1 (Kevin P Murphy)
949 pages
Algorithm For Spatial Data Analysis
No ratings yet
Algorithm For Spatial Data Analysis
9 pages
A Review of Comparative Studies of Spatial Interpolation Methods in Environmental Sciences - Performance and Impact Factors PDF
0% (1)
A Review of Comparative Studies of Spatial Interpolation Methods in Environmental Sciences - Performance and Impact Factors PDF
14 pages
Kriging Method Presentation
No ratings yet
Kriging Method Presentation
26 pages
9.uncertainty and Validation
No ratings yet
9.uncertainty and Validation
30 pages
Geostatistical Linkage of National Demographic and Health Survey Data: A Case Study of Tanzania
No ratings yet
Geostatistical Linkage of National Demographic and Health Survey Data: A Case Study of Tanzania
14 pages
Application of Machine Learning Methods To Spatial Interpolation of Environmental Variables
No ratings yet
Application of Machine Learning Methods To Spatial Interpolation of Environmental Variables
13 pages
CS229
No ratings yet
CS229
216 pages
Machine Learning
No ratings yet
Machine Learning
29 pages
Combining Laboratory Measurements and Proximal Soil Sensing Data in Digital Soil Mapping Approaches
No ratings yet
Combining Laboratory Measurements and Proximal Soil Sensing Data in Digital Soil Mapping Approaches
14 pages
SoilGrids1km Global Soil Information Bas
No ratings yet
SoilGrids1km Global Soil Information Bas
17 pages
Kriging External Drift - The Most Powerful Guide
No ratings yet
Kriging External Drift - The Most Powerful Guide
24 pages
Introduction To Machine Learning - Ethem Alpaydin
100% (4)
Introduction To Machine Learning - Ethem Alpaydin
432 pages
Identification of Heat Risk Patterns in The U.S. National Capital Region Byintegrating Heat Stress and Related Vulnerability
No ratings yet
Identification of Heat Risk Patterns in The U.S. National Capital Region Byintegrating Heat Stress and Related Vulnerability
13 pages
SCH Smo 03 C
No ratings yet
SCH Smo 03 C
24 pages
Mirchooli, F. Et Al. 2020
No ratings yet
Mirchooli, F. Et Al. 2020
10 pages
Lec56 Basic Overview of Digital Soil Mapping (DSM)
No ratings yet
Lec56 Basic Overview of Digital Soil Mapping (DSM)
16 pages
Machine Learning and The Physical Sciences5-8
No ratings yet
Machine Learning and The Physical Sciences5-8
4 pages
Lecture3 KRIGING
No ratings yet
Lecture3 KRIGING
43 pages
Sigma-Point Kalman Filters For Probabilistic Inference in Dynamic State-Space Models - Van Der Merve
No ratings yet
Sigma-Point Kalman Filters For Probabilistic Inference in Dynamic State-Space Models - Van Der Merve
398 pages
Gebhardt+Gebhardt-Bayesian Methods in Geostatistics
No ratings yet
Gebhardt+Gebhardt-Bayesian Methods in Geostatistics
2 pages
Machine Learning and The Physical Sciences5-15
No ratings yet
Machine Learning and The Physical Sciences5-15
11 pages
Interfacing Geostatistics and GIS
100% (1)
Interfacing Geostatistics and GIS
282 pages
Soja GFZ SchuhColloquium2024
No ratings yet
Soja GFZ SchuhColloquium2024
20 pages
GMD 15 3161 2022
No ratings yet
GMD 15 3161 2022
22 pages
Kernel Methods in Machine Learning
No ratings yet
Kernel Methods in Machine Learning
53 pages
70 Yrs of ML
No ratings yet
70 Yrs of ML
55 pages
POSTER Juan Francisco Agreda Vega TITLE Sobre El Aprendizaje Semi Supervisado Copy Copy Copy Copy
No ratings yet
POSTER Juan Francisco Agreda Vega TITLE Sobre El Aprendizaje Semi Supervisado Copy Copy Copy Copy
1 page
(2021) Accelerating Geostatistical Modeling Using Geostatistics-Informed Machine Learning
No ratings yet
(2021) Accelerating Geostatistical Modeling Using Geostatistics-Informed Machine Learning
30 pages
Fit Without Fear - Remarkable Mathematical Phenomena of Deep Learning Through The Prism of Interpolation
No ratings yet
Fit Without Fear - Remarkable Mathematical Phenomena of Deep Learning Through The Prism of Interpolation
51 pages
Snoussi 2007
No ratings yet
Snoussi 2007
45 pages
Optimization Theory with Applications
From Everand
Optimization Theory with Applications
Donald A. Pierre
4/5 (4)
Between Classification-Error Approximation and Weighted Least-Squares Learning
No ratings yet
Between Classification-Error Approximation and Weighted Least-Squares Learning
12 pages
Application and Comparison of Several Machine Learning Algorithms and Their Integration Models in Regression Problems
No ratings yet
Application and Comparison of Several Machine Learning Algorithms and Their Integration Models in Regression Problems
9 pages
Dear The Weight
From Everand
Dear The Weight
Masud Rana
No ratings yet
ML Main Printing Material
No ratings yet
ML Main Printing Material
241 pages
MET Driving Rain Index IRELAND
No ratings yet
MET Driving Rain Index IRELAND
5 pages
Learning With Kernels Support Vector Machines, Regularization, Optimization, and Beyond by Bernhard Schlkopf, Alexander J. Smola
No ratings yet
Learning With Kernels Support Vector Machines, Regularization, Optimization, and Beyond by Bernhard Schlkopf, Alexander J. Smola
644 pages
Machine Learning and Landslide Studies: Recent Advances and Applications
No ratings yet
Machine Learning and Landslide Studies: Recent Advances and Applications
49 pages
Cs229-Main Notes Andrew NG and Tengyu Ma
No ratings yet
Cs229-Main Notes Andrew NG and Tengyu Ma
227 pages
Main Notes
No ratings yet
Main Notes
227 pages
Hengl 2009 GEOSTAT-Excelente PDF
No ratings yet
Hengl 2009 GEOSTAT-Excelente PDF
293 pages
Introduction ML
No ratings yet
Introduction ML
65 pages
2210.15548 Compressed
No ratings yet
2210.15548 Compressed
40 pages
A Comparative Study of Mathematical Modeling Methods
No ratings yet
A Comparative Study of Mathematical Modeling Methods
7 pages
LN - Optimization For ML
No ratings yet
LN - Optimization For ML
129 pages
Machine Learning
No ratings yet
Machine Learning
33 pages
Andrew NG Main - Notes PDF
No ratings yet
Andrew NG Main - Notes PDF
226 pages
Stanford ML
No ratings yet
Stanford ML
168 pages
Supervised Learning
No ratings yet
Supervised Learning
6 pages
Kriging and Thin Plate Splines For Mapping Climate Variables PDF
No ratings yet
Kriging and Thin Plate Splines For Mapping Climate Variables PDF
9 pages
Main Notes
No ratings yet
Main Notes
227 pages
Vahid
No ratings yet
Vahid
18 pages
A Practical Guide To Geostatistical - Hengl
No ratings yet
A Practical Guide To Geostatistical - Hengl
165 pages
CS229 Lecture Notes: Andrew NG and Tengyu Ma April 25, 2023
No ratings yet
CS229 Lecture Notes: Andrew NG and Tengyu Ma April 25, 2023
223 pages
LN ML Rug
No ratings yet
LN ML Rug
283 pages
MML Book PDF
No ratings yet
MML Book PDF
416 pages
MML Book
No ratings yet
MML Book
381 pages
(Bernhard Schölkopf, Alexander J. Smola) Learning With Kernels PDF
No ratings yet
(Bernhard Schölkopf, Alexander J. Smola) Learning With Kernels PDF
645 pages
CS229 Andrew NG Lecture Notes
No ratings yet
CS229 Andrew NG Lecture Notes
216 pages
MML Book 2 PDF
100% (2)
MML Book 2 PDF
421 pages
Drilling Tool Failure Diagnosis Based On GA-SVM
No ratings yet
Drilling Tool Failure Diagnosis Based On GA-SVM
4 pages
MITPress - SemiSupervised Learning PDF
No ratings yet
MITPress - SemiSupervised Learning PDF
524 pages
Bott Curt Noce 18
No ratings yet
Bott Curt Noce 18
89 pages
Prediction of Vehicles' Trajecto-Ries Based On Driver Behaviour Model
No ratings yet
Prediction of Vehicles' Trajecto-Ries Based On Driver Behaviour Model
76 pages
Kernal Methods Machine Learning
No ratings yet
Kernal Methods Machine Learning
53 pages
Learning More Accurate Metrics For Self-Organizing Maps
No ratings yet
Learning More Accurate Metrics For Self-Organizing Maps
6 pages
Toc PDF
No ratings yet
Toc PDF
5 pages
Statistical Machine Learning-The Basic Approach and Current Research Challenges
No ratings yet
Statistical Machine Learning-The Basic Approach and Current Research Challenges
35 pages
Support Vector Machine (SVM)
No ratings yet
Support Vector Machine (SVM)
45 pages
Anuranan Das Summer of Sciences, 2019. Understanding and Implementing Machine Learning
No ratings yet
Anuranan Das Summer of Sciences, 2019. Understanding and Implementing Machine Learning
17 pages
Image Estimation by Example - Geophysical Sounding - J. Claerbout
No ratings yet
Image Estimation by Example - Geophysical Sounding - J. Claerbout
326 pages
C. Cifarelli Et Al - Incremental Classification With Generalized Eigenvalues
No ratings yet
C. Cifarelli Et Al - Incremental Classification With Generalized Eigenvalues
25 pages
A Tutorial on ν-Support Vector Machines: 1 An Introductory Example
No ratings yet
A Tutorial on ν-Support Vector Machines: 1 An Introductory Example
29 pages
Statistical Learning Theory
No ratings yet
Statistical Learning Theory
4 pages
Nonlinear Transformations of Random Processes
From Everand
Nonlinear Transformations of Random Processes
Ralph Deutsch
No ratings yet
Statistical Machine Learning-The Basic Approach and Current Research Challenges
No ratings yet
Statistical Machine Learning-The Basic Approach and Current Research Challenges
35 pages

Machine Learning and Geodesy A Survey

Uploaded by

Machine Learning and Geodesy A Survey

Uploaded by

J. Appl.

Geodesy 2021; 15(2): 117–133

Jemil Butt*, Andreas Wieser, Zan Gojcic, and Caifa Zhou

Machine learning and geodesy: A survey

Machine learning methods are more flexible with re- 1

tor, Aλ is an n-dimensional vector representing estima-

tions of the noiseless ytrue at the observed locations and

{gj (⋅)}mj=1 of functions used to approximate y only enters the

λ ∈ ℝm λ = [λ1 , . . . , λm ]T λk = k-th parameter fλ (λ|v) = fv (v|λ)fλ (λ) [ ∫ fv (v|λ)fλ (λ)dλ]

sense of being a minimizer for the discrepancy measure Bayesian inference.

implying that c exp(−q(f )) is a valid probability density

Table 1: Correspondence of terminology in machine learning and adjustment.

Terminology or quantity Role in ML Role in adjustment

1. Get initial solution: λ0 ∈ ℝ3 . 3 Learning algorithms and toy

convergence. ceed to explain three algorithms which do not have an ex-

for an example see Fig. 5. The algorithm is quite robust to

E[g1 (A)g2 (X)]

Mt2 = q21 Xt + q22 At

yi = +1}, χ− := {xi ∈ χ : yi = −1} where χ is the set of all

Figure 7: Simulated atmospheric effects and deformations in forms

The figure also exhibits an example, in which χ+ and

RKHS Hk that balances the number of misclassifications

where [ ⋅ ]+ denotes the positive part and y(x) ̂ =

for some positive C dependent on the parameter α from

We speculate that an influx of ideas and procedures References

You might also like