Machine Learning and Geodesy A Survey
Machine Learning and Geodesy A Survey
fects and instrument specific noise. Its traditional roots and its performance as measured by the variance of the
and focus on analytical solutions at times require strong residual random variable X̄ − X = n−1 ∑nk=1 Xk − X for any
prior assumptions regarding problem specification and new independent observation X
underlying probability distributions that preclude suc-
2 2
cessful application in practical cases for which the goal is 2 1 n 1 n
not regression in presence of Gaussian noise. σX−X = E [( ∑ Xk − X) ] − E [ ∑ Xk − X]
n k=1 n k=1
̄
field interpolation in the context of landslide monitoring at first glance might seem obscure in this setting but will be
[30] and thereby hinted at some of the potential of ma- demonstrated to work reasonably well and arise naturally
chine learning methods for typical geodetic core-tasks. when the viewpoint developed in section 2 is taken.
[29] stake out the role artificial intelligence might have to In those toy problems a dataset containing total sta-
play in geodesy and list several algorithms; however, they tion observations is subjected to a kernel based time series
focus more on possible future developments whereas we analysis to separate signal from noise, classified by a sup-
want to make explicit mathematical equivalences and dif- port vector machine (SVM) as stable or instable and split
ferences in perspective between the data analytical ap- into maximally independent parts by kernel independent
proaches taken in geodesy and machine learning. component analysis (K-ICA). We hasten to note that the ex-
One distinguishes machine learning tasks regarding amples presented in this paper are of an illustrative nature
the given inputs and the desired outputs. When a set of in- before closing with a discussion of the results and an out-
dependent variables xk and corresponding response vari- look on potentially interesting and worthwhile future ap-
ables yk is given in the form of a sequence {(xk , yk )}nk=1 and plications.
the algorithm is supposed to closely emulate the mapping
f : xk → yk the task is said to be supervised [18, pp. 26–28].
When only a sequence {(xk )}nk=1 is given and structure is to
be found without further guidance the task is called un-
2 Adjustment and machine learning
supervised. Many intermediate shades exist between the
We proceed by applying adjustment theory to a simple 1 di-
two extremes; e. g. reinforcement learning in which an al-
mensional regression / interpolation problem. By tackling
gorithm — designed to find optimal strategies in a stochas-
the same task with geostatistical and functional analytic
tically changing environment — receives positive or nega-
methods, the connections to statistical inference and de-
tive feedback but no ground truth or optimal strategy is
terministic function approximation are highlighted. This
known that could serve to construct reference values yk
allows to coach adjustment theory in a learning frame-
[34]. This scheme of clustering machine learning tasks can
work. Both adjustment and machine learning procedures
be contrasted with a more output oriented one, in which a
make use of the same words but their meanings often differ
task is called regression if the output is numerical or clas-
considerably. To alleviate the confusion we will always de-
sification if it is categorical to name only the two most com-
fine the quantities appearing in this chapter strictly math-
mon formats [18, pp. 26–28].
ematically and we try to keep with the usual notational
The premises of geodetic data analysis as embodied
customs of the respective fields as far as no contradictions
by what is known as adjustment theory are typically nar-
arise. Furthermore, we hope that Table 1 provides a guide-
rower [5]: Measurements are sequences of real numbers
line to translate terminology between machine learning
{yk }nk=1 and there exists a set of parameters {λk }m k=1 such
m
and adjustment based approaches to estimation and urge
that its transform A({λk }k=1 ) by some function A resem-
the reader to briefly skim over it before entering the next
bles {yk }nk=n apart from a residual term that is assumed
section. However, it is by no means complete and the
to be entirely stochastic in nature [26, p. 137]. Several ex-
reader will have to fill in some of the missing pieces him
tensions exist most notably among them collocation; see
or herself as he or she advances through the text.
e. g. [23, 6]. The above problem is a supervised regression
The mode of presentation is geared towards parallel-
problem one could equally well tackle with different meth-
ing that of earlier survey articles establishing links be-
ods. In the next section classical least squares solutions
tween processing schemes in geodesy and various other
for a very basic estimation task are rederived from different
disciplines of science; we specifically recommend [16].
starting points. This will reveal differences in philosophies
between geodesy and machine learning regarding how to
pose a problem even though the calculations ultimately 2.1 Regression / interpolation problem
yield the same equation. The equivalence of adjustment to
an algorithm that may be considered as belonging to ma- Suppose n observations {yk }nk=1 are given together with the
chine learning (Gaussian process regression / Kriging) and locations {xk }nk=1 ⊂ X at which they were performed. The
one that surely does so (optimization in reproducing ker- goal is to estimate the values y(x) even for unobserved lo-
nel Hilbert spaces / splines) is shown and augmented with cations x ∈ X , see Fig. 1 for an illustration.
a Bayesian interpretation. Section 3 is devoted to toy ex- A typical set of assumptions and procedures to de-
amples from geodesy that defy being solved by adjustment rive a solution within an adjustment theoretic framework
theory and require algorithms from machine learning that would consist in the items listed below.
J. Butt et al., ML and geodesy | 119
λ∗ = (AT Σ−1 −1 T −1
v A) A Σv y. Then λ is an m-dimensional vec-
∗
L(λ, v) = fv (v|λ)
1
i) Assume there is an underlying deterministic function = (2π)−n/2 √det Σv exp [− v(λ)T Σ−1 v v(λ)]
2
of x depending linearly on a set of m parameters λ, i. e.
log fv (v|λ) = c1 − c2 ([Σ−1/2
v v(λ)]T [Σ−1/2
v v(λ)])
(ytrue )i = ∑m
j=1 λj gj (xi ) = Aλ with the n × m Matrix A
having entries (A)ij = gj (xi ). = c1 − c2 ‖Σ−1/2
v (Aλ − y)‖2ℓ2 (4)
ii) The deviations between ytrue and y are due to measure-
where c1 and c2 are constants and fv (v|λ) is the conditional
ment noise which is assumed to be multivariate Gaus-
probability density function of the random variable v rep-
sian with expected value zero and covariance matrix
resenting the residuals due to measurement error given pa-
Σv , preferably diagonal.
rameters λ and the distributional information about their
iii) Minimize the weighted sum of squares vT Σ−1 v v of resid-
means and covariances. Since log(⋅) is a monotonous func-
uals v(λ) = Aλ − y by choosing the optimal set λ∗ of
tion, the maximizer of log fv is also the maximizer of fv and
parameters λ.
the likelihood L(λ, v) implying that the least squares solu-
tion is a maximum likelihood estimator. Note at this point
We arrive at the following-Gauss Markov model [26, p. 137]:
that the likelihood L(λ, v) = fv (v|λ) is proportional to fλ (λ|v)
Aλ − y = v E[v] = 0 E[vi vj ] = (Σv )ij via Bayes rule [27, p. 60]
n T
y∈ℝ y = [y1 , . . . , yn ] yk = k-th observation ∞ −1
λ∗ = argmin‖Σ−1/2
v (Aλ − y)‖2ℓ2 2.2 Adjustment as a learning task
λ∈ℝm
̃ − ỹ‖22
= argmin‖Aλ The adjustment approach to interpolation can be identi-
fied as a supervised regression problem. The set of tuples
ℓ
λ∈ℝm
= A ỹ
̃+ (3) {(xk , yk )}nk=1 are the training data, the goal is to approx-
imate the input-output relation between the {xk }nk=1 and
where A ̃ = Σ−1/2 A, ỹ = Σ−1/2 y and A
v v
̃ + is the pseudoinverse {yk }nk=1 where the decision variable is the target vector λ.
of A
̃ [33, p. 218]. Therefore the well known formula for λ∗ is Essentially nothing changes, if the pretext of an artificial
120 | J. Butt et al., ML and geodesy
interpolation problem is dropped; the solution of a linear 2.3 Adjustment, geostatistics and splines
adjustment problem in Gauss-Markov form can always be
written as [13, p. 93] In geostatistics, to solve the interpolation problem, one
would assume the observations yk to be realizations of a
m
stochastic process {Y(xk )}nk=1 with Y(x) ∈ L2 (Ω) a square
y(⋅)
̂ = ∑ λk∗ gk (⋅) (5)
k=1 integrable random variable for all x ∈ X and an estimator
λ∗ = argmin‖A(x)λ − y‖2H (6) Y(x)
̂ for Y(x) in general is sought. Assemble this estima-
λ∈ℝm tor as a function of the given random variables {Y(xk )}nk=1
in such a way as to minimize the expected square loss
where ⟨f , g⟩H = ⟨Σ−1 f , g⟩ℓ2 is the inner product in some 2
E[(Y(x)
̂ − Y(x)) ] which is the error variance of the estima-
Hilbert space. Here we wrote A(x) to explicitly document
tion.
that the design matrix contains nonlinear features in x —
It can be proven [25] that for a zero-mean Gaus-
a notion that is quite straightforward to interpret in the in-
terpolation case as sian process the best predictor Ŷ functionally dependent
on some set Yk = Y(xk ), k = 1, . . . , n is the condi-
g1 (x1 ) ... gm (x1 ) tional expectation, which is furthermore linear in its argu-
[
A(x) = [ .. .. ] ] ments.
[ . . ]
g 1 (x n) ... gm (xn )]
[ Y(x)
̂ = E[Y(x)|Y1 , . . . , Yn ] (9)
in this case contains e. g. polynomials in x. However in ar- n
Y(x)
̂ = ∑ αk Yk (10)
bitrary abstract adjustment problems, it might not always
k=1
be easy to identify what the independent variable {xk }nk=1
corresponds to, if just A(x) as a matrix of features is pro- Presupposing knowledge of the mean-zero joint Gaus-
vided. When for example the levelling problem (7) sian distribution, denote by σ(Y(x1 ), Y(x2 )) the covariance
E[Y(x1 )Y(x2 )] of the two random variables Y(x1 ), Y(x2 ) ∈
1 0 0 H1∗
[ 1 −1 0 ] H1 L2 (Ω); x1 , x2 ∈ X . To find these α for which Y(x)
̂ = ∑nk=1 αk Yk
[ ] [Δh1 ]
[ ]
[
[
]
] [ H2 ] ≈ [ ] (7) is the conditional expectation, minimize
[0 1 −1] [Δh2 ]
[H3 ]
[−1 0 1]
⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟ ⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟ [ Δh3 ]
⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟ E[(Y(x)
̂ 2
− Y(x)) ] = σ(Ŷ − Y, Ŷ − Y) =: σα2 (v(x)).
A(x) λ y
H1∗ : Approximately known height Since the covariance σ(⋅, ⋅) is bilinear in its arguments, this
Hk : Heights to be determined amounts to solving 𝜕/𝜕αk σα2 (v(x)) = 0, k = 1, . . . , n with
Δhk : Measured height difference (8) n n
σα2 (v(x)) = σ (∑ αi Yi − Y, ∑ αj Yj − Y) (11)
is given, it is quite hard to interpret the rows of A(x) as i=1 j=1
nonlinear features of some scalar x. However, we might n n n
always resort to the mental trick of considering the rows = ∑ ∑ αi αj σ(Yi , Yj ) − 2 ∑ αi σ(Yi , Y) + σ(Y, Y)
i=1 j=1 i=1
of A(x) as linear features of a vector valued independent
variable x ∈ ℝm . Concretely this means having as train- This immediately implies
ing data {(xk , yk )}nk=1 = {([1 0 0]T , H1∗ ), ([1 − 1 0]T , Δh1 ), . . .}
and approximating a function f : ℝ3 → ℝ that maps the 𝜕 2 n
σα (v(x)) = 2 [∑ αi σ(Yi , Y) − σ(Yi , Y)] = 0
!
xk to the yk linearly, i. e. f (xk ) = f ([xk1 , xk2 , xk3 ]T ) = H1 xk1 + 𝜕αk i=1
H2 xk2 + H3 xk3 . But this equation just defines a hyperplane
in ℝ4 indicating that the adjustment problem has been and α consequently satisfies
reduced to a simple regression in a higher dimensional
space. σ(Y1 , Y1 ) . . . σ(Y1 , Yn ) α1 σ(Y1 , Y(x))
We present for comparison a geostatistical and a func-
[
[ .. .. .. ][ . ] [
][ . ] = [ .. ]
] (12)
[ . . . ][ . ] [ . ]
tional analytic approach, that both enjoy some popular- σ(Y , Y ) . . . σ(Y , Y ) α σ(Y , Y(x))
[ n 1 n n ] ⏟⏟⏟⏟⏟⏟⏟⏟⏟
⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟ [ n ] ⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟
[ n ]
ity in the machine learning community under the names Σ α Σx
of Gaussian process regression and splines in reproducing
kernel Hilbert space. The equations will largely be identi- The above formulae are known as the simple Kriging equa-
cal but the spirit is noticeably different. tions [10, p. 152]. Solving this system leads to the optimal
J. Butt et al., ML and geodesy | 121
choice of coefficients α for assembling the simple Krig- Summarizingly, from the geostatistical perspective,
ing predictor Ŷ SK = ∑nk=1 αk Yk out of measurements Yk , the inclusion of randomness results in a more flexible
k = 1, . . . , n and finally model for the predictions and residuals. This contrasts
with the randomnesses role as a cover term to sub-
Ŷ SK (x) = αT {Yk }nk=1 = ΣTx Σ−1 {Yk }nk=1 . (13) sume unwanted and unmodelled effects in terms of de-
viations from the parametric model in classical adjust-
In the case where also the mean function is unknown, ment.
needs to be estimated and has form h(x) = ∑m l=1 βl gl (x), the In the approach described above, we minimized
universal Kriging system [10, p. 168] arises instead:
E[(Y(x)
̂ − Y(x))2 ] = ‖Y(x)
̂ − Y(x)‖2L2 (Ω)
Σ A α Σ
[ T ][ ] = [ x] (14)
A 0 μ Ax pointwise for each x ∈ X separately to derive a predic-
tor Y(x)
̂ because we took as fundamental the notion of
where α, Σ, Σx are defined as in equation (12), μ is some
a random variable and its variance. It is possible to ab-
m-dimensional Lagrange multiplier, (A)ij = gj (xi ) and
stract from this situation by introducing spaces H(X ) of
(Ax )j = gj (x) defines a column vector. For a fixed x ∈ X ,
functions f : X → ℝ with Gaussian measures on these
the optimal estimator is the universal Kriging predictor
spaces [21] which allow writing the probability of having
Ŷ UK = ∑nk=1 αk Yk with the α chosen to satisfy the system
a randomly drawn f ∈ H(X ) in the subset Q ⊂ H(X )
of linear equations specified above.
as
By solving system (14) via substitution, the coefficient
vector α = [α1 , . . . , αn ]T is found explicitly and the estima- P(f ∈ Q) = ∫ dν(f )
tor can be decomposed into three components. Q
T −1 T −1
α = Σ [Σx − A(A Σ A) (A Σ Σx − Ax )]
−1 −1
where the right hand side is an integral through func-
tion space against some measure ν. Under certain as-
ΣTx Σ−1 {Yk }nk=1
Ŷ UK (x) = ⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟ (15) sumptions [14, p. 9] H(X ) turns out to be completely
Ŷ1 (x) determined by its covariance operator — an infinite di-
ATx (AT Σ−1 A)−1 AT Σ−1 {Yk }nk=1
+ ⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟ mensional analogue of the covariance matrix satisfying
Cf : H(X ) ∋ g → Cf g = E[⟨f , g⟩H f ] ∈ H(X ) which in turn
Ŷ2 (x)
T −1 T −1 −1 T −1 n
is completely specified once the second moment function
− Σ x Σ A(A Σ A) A Σ {Yk }k=1
⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟ K(x1 , x2 ) = E[f (x1 )f (x2 )] is known [3, p. 29].
Ŷ3 (x)
The space H(X ) can be shown to be the reproducing
kernel Hilbert space HK with reproducing kernel K(⋅, ⋅) :
Comparing the above terms to equations (13) and (3), we
X × X → ℝ. In this function space the norm ‖f ‖HK is in-
find that
versely related to its probability of occurrence [14, p. 19].
Ŷ 1 (x) = Ŷ SK (x) Ŷ 2 (x) = Ŷ Adjustment (x) This leads one to formulate the estimation problem glob-
ally for all x ∈ X simultaneously as
and Ŷ 3 (x) is a cross term accounting for the fact that the es-
timated mean Ŷ Adjustment (x) needs to be subtracted for nor- σf = argmin ‖f ‖2HK (17)
f ∈HK :Lf ={yk }nk=1
malization. An alternative way of writing (15) would there-
fore be where σf (⋅) is then called an interpolating spline. The op-
erator L : HK → ℝn goes by the name of measurement
Ŷ UK (x) = Ŷ Adjustment (x) + V̂ SK (x) (16) operator and relates the function f (⋅) to the observed val-
ues yk , k = 1, . . . , n. It is simple evaluation in this case,
where V(x) = Y(x) − Ŷ Adjustment (x), the residual af- i. e. (Lf )j = yj = Lj f . Minimization of ‖f ‖HK is reasonable
ter subtraction of the estimated mean function h(x) = as the whole problem (17) then translates to finding that
ATx (AT Σ−1 A)−1 AT Σ−1 {Yk }nk=1 . We find the main difference to function f which is yk at the positions xk and is most likely
adjustment to be the existence of an estimation term for a as described by some Gaussian measure on the Hilbert
stochastic component owing to the fact that what we want space HK .
to estimate is only somewhat correlated to the measure-
ments.
122 | J. Butt et al., ML and geodesy
If we decide to drop the interpolating conditions and tasks the quantity to be optimized is often the likelihood
replace them with the constraint that Lf be “close” to of residuals whose assumed Gaussian distribution yields
the observations {yk }nk=1 as measured for example in the the classical least squares formulations. Estimation tasks
ℓ2 -norm, the smoothing spline equation (18) ensues. arising in machine learning seem to less often make a
strict distinction between deterministic signal and ran-
σf = argmin‖Lf − {yk }nk=1 ‖2ℓ2 + ‖f ‖2HK (18) dom noise and at times avoid making use of distribu-
f ∈HK
tional assumptions altogether. Instead, they communicate
It balances fidelity to the data and likelihood of the chosen an estimator’s desirability via an objective function that
function. The explicit solution is given by [4, p. 161] is not in all cases stochastically motivated. This leads to a
wider variety of estimators whose properties are less well-
n
known but interesting nonetheless. As shown in the pre-
σf (⋅) = ∑ λj Lj K(⋅, ⋅) (19)
j=1
vious equations (14) and (18), norm minimization tasks of
the type
λ = (Σ + I)−1 {yk }nk=1 (20)
σf = argmin‖Af − y‖2ℓ2 + ‖f ‖2HK (21)
where I is the n × n unit matrix and (Σ)ij = K(xi , xj ). For a f ∈HK
specific σf (x) one gets
correspond to optimal estimation in presence of white
n noise on the measurements Af of a stochastic process f
σf (x) = ∑ λj K(xj , x) = ΣTx Σ−1 {yk }nk=1 with covariance function K(⋅, ⋅). This correspondence ex-
j=1
tends uniquely to an adjustment problem Aλ − y = v with
under interpolating conditions — this is just the simple white noise v on the measurements and a prior that fa-
Kriging estimator if K(x1 , x2 ) = E[Y(x1 )Y(x2 )]. Extensions vors small lengths of the coefficient vector λ. Even though a
to account for unknown means are standard and ulti- prior on coefficient vectors λ with (Aλ)k = (∑m j=1 λj gj (xk ))k ≈
mately yield the same predictions Ŷ Spline (x) = σf (x) as uni- yk seems — at least from this perspective — puzzling at
versal Kriging [3, pp. 88–91]. first, it enters naturally if one assumes that the linear com-
Finally notice that at the {xk }nk=1 for which observa- bination ∑m j=1 λj gj (⋅) is itself chosen randomly with the λj ’s
tions are available distributed as multivariate Gaussian.
This opens up interpretations of further machine
Ŷ Spline (xk ) = Ŷ UK (xk ) = Ŷ Adjustment (xk ) + V̂ k = yk learning methods that are similar in flavour to the abstract
spline problem (21). Consider for example
where V̂ is the residual Aλ∗ − {yk }nk=1 from the adjustment
procedure. This allows the conclusion that splines and Ridge regression: σf = argmin‖Af − y‖2ℓ2 + α‖Bf ‖2ℓ2
f ∈ℝm
Kriging as representers of machine learning approaches
on the one hand and adjustment as a representer of clas- LASSO: σf = argmin‖Af − y‖2ℓ2 + α‖f ‖ℓ1
f ∈ℝm
sical geodetic techniques on the other hand are basically
Elastic net: σf = argmin‖Af − y‖2ℓ2 + α1 ‖f ‖2ℓ2 + α2 ‖f ‖ℓ1
equivalent bar the philosophical difference of what is con- f ∈ℝm
sidered uninteresting noise to be discarded and what is
not. [15, pp. 61, 68, 118] where the α’s are some positive con-
Another difference is that in the adjustment formula- stants that determine if faithfulness to the data or regular-
tion, the decision variable is a parameter vector λ which ity of the estimator are prioritized and B is some linear op-
determines a function f whereas in the machine learning erator. In the above, ‖ ⋅ ‖ℓp denotes the classical ℓp norms,
formulation the function f is itself the decision variable to i. e.
be determined via optimization.
m
‖f ‖ℓp = √p ∑ |fk |p .
k=1
2.4 Connection to other norm-based
algorithms Note that the ℓp norms are nonnegative functions of f ; con-
sequently minimizing them is equivalent to maximizing
Defining estimators as solutions to norm minimization a likelihood. This holds since for any nonnegative func-
problems is a common method of formalization in both tion q(f ) ≥ 0 ∀f ∈ ℝm satisfying additional constraints
geodesy and machine learning. In geodetic estimation exp(−q(f )) is normalizable with c−1 = ∫ℝm e−q(f ) df < ∞
J. Butt et al., ML and geodesy | 123
Figure 2: The ‖ ⋅ ‖ℓ1 and ‖ ⋅ ‖ℓ2 norms and their corresponding proba-
bility densities associated with the Gaussian and Laplacian distribu- Figure 3: The ℓ1 -norm based estimation is much more robust than
tion. Note the Laplacian’s heavier tails. the ℓ2 -norm based estimation. The lower panels show the resid-
uals between coordinates in system 2 and the coordinates trans-
formed into system 2 from system 1 via transformations derived
from Eq. (22). The scale in the lower panels is identical.
The Gaussian pdfs derivative at its mean is zero; the
pdfs value converges to zero extraordinarily fast. The
Combining I and II explains why ℓ1 -norm minimization
Laplacian pdf in contrast has heavy tails but its derivative
leads to sparse and robust estimators that can systemat-
at the mean is undefined. We extract the following from
ically outperform ℓ2 -norm based least squares solutions.
our discussion and the images in Fig. 2:
Therefore ridge regression might be seen as adjustment
I When minimizing the ℓ2 -norm or equivalently max-
with a prior on the length of Bλ, LASSO has a sparsity prior
imizing the likelihood under a Gaussian pdf, small
on the parameter vector λ and elastic net regularization
residuals are considered almost irrelevant since the
balances both. To obtain the usual interpretations, swap
gradient of ‖ ⋅ ‖2ℓ2 around 0 is zero. Large deviations
f for λ in the above and assume the stochastic process f to
are punished disproportionately strong: During mini-
be determined by a multivariate Gaussian on Bf , sparse or
mization decreasing a big residual is considered more
a combination of both.
favourable than decreasing several small ones by the
We demonstrate performance difference of ℓ2 and
same amount. 1
ℓ -norm based estimation in the presence of outliers for
II When minimizing the ℓ1 -norm or equivalently max-
the typical geodetic task of inferring a Helmert transfor-
imizing the likelihood under a Laplacian pdf, small
mation with fixed scale from coordinate measurements in
residuals are punished less than big ones but still
Fig. 3. We briefly sketch the algorithm used to find the op-
severely as the gradient of ‖ ⋅ ‖ℓ1 around 0 is constant
timal transformation A(λ∗ ) with
and positive driving either f to sparsity (if ‖f ‖ℓ1 → min)
or leading to sparse residuals (if ‖Af − y‖ℓ1 → min). λ∗ = argmin ‖A(λ)x − y‖ℓp p = 1, 2 (22)
λ=[xA ,yA ,φA ]∈ℝ3
Big residuals are penalized proportionally: decreasing
a big residual is as good as decreasing an already small that maps the coordinates x in system 1 onto the coordi-
residual by the same amount. nates y in system 2:
124 | J. Butt et al., ML and geodesy
Operator A in ‖Aq − y‖ → min. A is an operator that maps f onto A is a matrix whose rows are vectors of
measurements of f and emulates the way that (nonlinear) transformations of the points xk ,
observations y are generated. Af then typically k = 1, . . . , n. Aλ then typically is
is f evaluated at points xk , k = 1, . . . , n. A is f (⋅) = ∑mj=1 λj gj (⋅) evaluated at the points xk .
called the measurement operator. A is called the design matrix.
p
Term ‖Bq‖rr in ‖Aq − y‖p + ‖Bq‖rr → min. With q = f , ‖Bf ‖rr is a regularization term that Since q = λ is a vector of parameters, there
includes a prior on the function f into the seems hardly any justification for penalizing
estimation of f . The exact nature of the prior terms ‖Bλ‖rr . With B = αI, α > 0 and r = 2 they
depends on norm r and energy operator B. may be introduced for numerical reasons
under the name of Tikhonov regularization.
Randomness and residuals The quantity f to be estimated is assumed to The quantities λ to be estimated are assumed
come from a stochastic process. Unmodelled to have fixed deterministic values.
effects can be pushed onto f during estimation Randomness is a property of the residuals
but f is very flexible. The residuals v = Af − y v = Aλ − y that act as a flexible catch all term
are random too; f and v are distinguishable subsuming all effects unaccounted for by the
only via their correlation structure. parametric model.
Features A feature is a potentially infinite dimensional During the construction of the design matrix
vector that contains (nonlinear) A, the concept is used implicitly. Each of its
transformations of the input variable x, i. e. rows can be interpreted as a feature in some
g(x) = {gj (x)}∞
j=1 for some set of functions gj . input variable x.
Representations A representation of a dataset {yk }nk=1 is a choice The choice of a good representation is left to
of basis functions {gj }∞
j=1 such that each yk is the practitioner, whose responsibility it is to
representable as a combination of gj ’s. A either determine a set of function {gj }m j=1 such
representation can be determined that ∑mj=1 λj gj (xk ) approximates yk or derive
automatically by solving an optimization them from the geometrical of physical
problem. configuration of the task.
Note that many special procedures exist, which is why our explanations are geared to a proper description of only a simple subset of tasks that
might be formulated as the minimization of discrepancy and irregularity measures. We consider these to be a good first order approximation
to many commonly encountered problems in both fields.
of basis functions Kx (⋅) := K(x, ⋅) evaluated at the sample have at the same time ϕx1 (⋅) as an element of a (repro-
points {xk }nk=1 . The stochastically optimal choice for this ducing kernel) Hilbert space HK [32, p. 39] and as an
so called kernel function turned out to be given by the infinite set of X -parametrized powers of x1 . As ϕx1 (⋅)
covariance function K(x1 , x2 ) = E[Fx1 Fx2 ] where ∀c ∈ X , contains nonlinear information about x1 it is called a
Fx : Ω ∋ ω → Fxω ∈ ℝ was a square integrable random (nonlinear) feature of x1 .
variable indexed by the space variable x ∈ X .
This view immediately suggests to generalize the finite It should now be clear that ϕx (⋅) ∈ HK is an infinite di-
dimensional covariance matrix ΣF of a random vector F mensional representation of x ∈ X that for specific choices
taking values in ℝn , n < ∞ and satisfying of K(⋅, ⋅) can even encode all the information possibly to
be known about x ∈ X [12]. We will call this the feature-
⟨ΣF g, h⟩ℝn = ⟨E[F ⊗ F ∗ ]g, h⟩ℝn interpretation of a kernel in what follows.
= E[⟨F, g⟩ℝn ⟨F, h⟩ℝn ] ∀g, h ∈ ℝn , (23) The reader is advised to not mix up both interpreta-
tions as the implied objects of investigation are different.
towards the typically infinite dimensional covariance op- The covariance-interpretation assumes the measured ob-
erator CF exhibiting an exactly analogue relationship [3, jects to be (nonlinear) functions f : X → ℝ with f ∈
p. 29]. This can be done by defining it as the selfadjoint HK subjectable to linear operations only. In contrast to
positive definite kernel operator CF : HK ∋ g → (CF g)(⋅) := this, the feature-interpretation assumes the measured ob-
∫X K(x, ⋅)g(x)dx ∈ HK . jects to be x ∈ X and embeds them nonlinearly in HK
As such K(⋅, ⋅) takes the role of a function determining for some reproducing kernel Hilbert space HK . Both per-
the entries in an infinite dimensional covariance matrix spectives rely on the algebraic and geometric properties
that will intuitively be recognized by the practical geode- of the involved reproducing kernel Hilbert spaces (RKHS),
sist as a natural extension of the already known frame- whose internal structure as determined by the kernel K(⋅, ⋅)
works to function space valued estimation problems. For impacts the form of estimators and feature embeddings
the remainder of the paper we term this way of think- alike.
ing about a kernel the covariance-interpretation. There is, We proceed to apply both high dimensional embed-
however, a second radically different perspective onto ker- ding philosophies to supervised and unsupervised prob-
nels that is used concurrently in machine learning [32, lems and start with the more familiar one of interpreting
p. 39] and emphasizes the meaning of K(x, ⋅) = ϕx (⋅) ∈ HK kernels as generators of covariance matrices.
as a Hilbert space valued nonlinear feature of x ∈ X .
An instructive way to illustrate this consists in two
separate steps that are roughly sketched for the special 3.1 Application 1: signal separation for total
case of a Gaussian kernel K(x1 , x2 ) = exp(−‖x1 − x2 ‖2 ) for
station data when the covariances are
x1 , x2 ∈ X ⊂ ℝ.
1. Rewrite −‖x1 − x2 ‖2 as −‖x1 ‖2 + 2⟨x1 , x2 ⟩ − ‖x2 ‖2 and
known
substitute this term in the exponential expression for Suppose a total station was set up as depicted in Fig. 4 to
K(x1 , x2 ) to derive monitor the movement of a prism. To not unnecessarily
2 complicate this example, it will be assumed to hold that
K(x1 , x2 ) = c(x2 )e2x1 x2 −x1 (24)
no tilt occurs and movement of the prism is constrained
∞
Hn (x2 ) n ∞
to purely lie in the x-direction implying a one dimensional
= c(x2 ) ∑ x1 = ∑ αn (x2 )x1n
n=0
n! n=0 formulation to be sufficient. The measurements are sup-
plied in the form of a time series of x-coordinates.
where Hn (⋅) is the n-th Hermite polynomial [31, p. 456]
and αn (x2 ) := c(x2 )Hn (x2 )/n! is a function solely de- Given: A sequence of times tj ∈ T and corresponding
pending on x2 . measurements mj ∈ ℝ of the x-coordinate in the format
2. Notice that K(x1 , x2 ) is effectively a linear superposi- {(tj , mj )}nj=1 where n is the number of measurements.
tion of monomials x1n , n ∈ ℕ0 where the coefficient
Goal: Split the signal into separate parts that are in a
vector {αn }n∈ℕ0 depends on the exact value of x2 . As
stochastically reasonable way optimally identifiable with
x2 is varied to x2 the coefficient vector changes as well
noise, atmospheric influences and true x-coordinate.
resulting in K(x1 , x2 ) being a different linear combina-
tion of powers of x1 . When x2 ∈ X is not fixed at all, Assumption: The measurements mj are realizations of
then K(x1 , ⋅) = ϕx1 is a function from X to ℝ and we square integrable random variables Mt⋅j : Ω ∋ ω → Mtωj ∈ ℝ
126 | J. Butt et al., ML and geodesy
for all tj ∈ T; i. e. {Mt : t ∈ T} is a stochastic process and 3.2 Application 2: signal separation for total
separable in the following way: station data when the covariances are
unknown
Mt = Nt + At + Xt (25)
In what follows, the requirements on prior knowledge are
where the stochastic processes Nt , At , Xt correspond to
relaxed and the covariance structure is no longer assumed
noise, atmosphere and x-coordinate respectively and are
to be known. In only presupposing the measurements as
independent from each other. Their covariance functions
being made up of statistically independent parts, we pass
(=kernels KN , KA , KX ) are assumed to be either known ap-
from a supervised to an unsupervised learning problem
proximately or inferrable from the different time scales of
that has no analytical solution anymore. Before describ-
Nt , At , Xt that find their expression in the decay character-
ing and applying K-ICA in this setting, it is instructive to
istics of the kernels.
explain the most common measures for characterizing in-
Main idea: The measurements M⋅⋅ : Ω ∋ ω → M⋅ω ∈ ℝT dependence of random variables.
are assumed to lie in some infinite dimensional Hilbert Two random variables X, Y : Ω → ℝ are called in-
space HM with kernel KM = KN + KA + KX which implies dependent if — bar some technicalities concerning mea-
that HM is the direct sum of the Hilbert spaces contain- surability and continuity — their joint probability density
ing pure noise, atmospheric influences and x-coordinates function fXY (x, y) factors into the product of its marginals;
in the sense that HM = HN ⊕ HA ⊕ HX . For a more rigor- i. e. fXY (x, y) = fX (x)fY (y) [1, p. 91]. One writes X ∐ Y in this
ous account, see [21]. An optimal interpolating spline σm case. It is well known that two jointly multivariate Gaus-
is found such that σm (tj ) perfectly coincides with the mea- sian distributed random variables X and Y are uncorre-
surements at times tj . lated if and only if they are independent [27, p. 71]. Gen-
erally, however, two random variables X, Y may have zero
σm (⋅) = argmin ‖m‖2HM (26) correlation without necessarily being independent. In the
m∈HM :m(tj )=mj
n
non-Gaussian case the covariance function
ij
σm (⋅) = ∑ λj KM (tj , ⋅) with λ = (KM )−1 m (27)
j=1 cov(X, Y) = ∫ (x − E[X])(y − E[Y])fXY (x, y)dxdy
ℝ2
ij
whereby KM is the matrix with entries KM (ti , tj ) ∈ ℝn ⊗ ℝn
and m ∈ ℝn is the n-dimensional vector containing the is therefore a necessary but insufficient indicator of inde-
measurements. Notice that σm (⋅) is not a number but a pendence and needs to be replaced by the entropy-based
function of t ∈ T. Subsequently σm (⋅) ∈ HM will be or- mutual information I(X, Y).
thogonally projected onto the subspaces HN , HA and HX The mutual information between two random vari-
to yield the optimal estimators σn = ΠN σm , σA = ΠA σm , ables is defined as [19]
σx = ΠX σm . fXY (x, y)
I(X, Y) = ∫ fXY (x, y) log [ ] dxdy (28)
Results: When reliable covariance information is avail- fX (x)fY (y)
ℝ2
able, the results are stochastically optimal under a Gaus-
sian process assumption [28, p. 27]. Also on a purely vi- where the expression is evaluated as an appropriate limit
sual level the outcome of applying the estimation proce- in any pathological cases. It can be shown that I(X, Y) =
dure above to an exemplary dataset seems reasonable — 0 ⇔ X ∐ Y and otherwise I(X, Y) > 0.
J. Butt et al., ML and geodesy | 127
Figure 5: An example of signal separation. The left panels show the true underlying ground truth (synthetic data), that is superimposed
to generate the signal plotted in the center. This time series is the input for the RKHS based estimation framework outlined in section 3.1
whose output are the estimations visible on the right side. The scale is the same for the six outer plots.
Therefore from a theoretical perspective if one wanted cient finite dimensional implementation of this infinite di-
to split a timeseries {mj }nj=1 linearly into independent com- mensional problem.
ponents {aj }nj=1 and {xj }nj=1 one could try to minimize the Suppose, a total station S was set up as depicted
mutual information between the {aj }nj=1 and the {xj }nj=1 in Fig. 6 to monitor the movement of two prisms P1 , P2
which are assumed to be realizations of random variables mounted on a planar structure subject to a translational
A and X with significantly different probability distribu- rigid, but time dependent change of coordinates. To keep
tions. Typically estimating the mutual information empir- the example simple, only the x-coordinates will be inves-
ically is hard, however, and it is more common to instead tigated to arrive again at a one dimensional formulation
maximize a contrast function ρ(A, X) that convincingly that parallels the one presented in section 3.1 but with in-
measures how different the distribution of A is from that creased difficulty due to the absence of any knowledge of
of X. Such contrast functions ρ are typically derived from a the correlation structure of the signals to be separated.
Taylor expansion of −I(⋅, ⋅) in terms of features of probabil-
ity distributions (e. g. third moment or higher order cumu-
lants) that partially emulate the property of −I(A, X) being
biggest for fA being very different from fX [2].
It is entirely possible to apply the aforementioned in-
finite dimensional embedding of probability distributions
into an RKHS HK in this setting. An efficiently computable
measure of dependence to be minimized is then given by
the HK -correlation
with atmospheric influences and the true x-coordinates of At with Xt ∐ At , a 2 × 2 matrix W with WMt = Ŷ t max-
P1 , P2 . imally independent in the sense of mutual information
would solve the problem apart from the usual ambiguities
Assumption: The measurements mkj are realizations of
encountered during ICA [17].
square integrable random variables Mtkj for all tj ∈ T, i. e.
To approximately achieve this, minimize the
{Mtk : t ∈ T} are two stochastic processes which we assume Hk -correlation
to be linear mixtures of deformations and atmospheric in-
fluences: E[f (X̂ t )g(Â t )]
ρHk (X̂ t , Â t ) = sup
f ,g∈Hk √E[f 2 (X̂ )]E[g 2 (Â )]
Mt1 = q11 Xt + q12 At t t
described by a translation to guarantee that the behaviour from Ŷ t = WMt . Then center the kernel matrices.
of the two prisms x-coordinates is identical. Furthermore II Solve the regularized kernel canonical correlation
the atmospheric conditions need to be constant over the generalized eigenvalue problem
whole spatial domain to ensure that their influence on the
0 K1 K2 λ
measurement series {mkj }nj=1 is representable as terms q12 At [ ][ ]
K2 K1 0 μ
and q22 At linearly related to some underlying scalar At .
(K1 + αI)2 0 λ
Further explanation: The atmospheric conditions are al- = ρW [ 2] [ ]
0 (K2 + αI) μ
lowed to vary in time. We may calculate the entries of Q
based on knowledge of the geometrical configurations and
for ρW to determine the Hk correlation dependent on
usual formulas for distance reduction of electrooptic mea-
the matrix W.
surements by noting that the atmospheric correction Δxk
III Minimize −1/2 log λW where λW is the smallest of the
satisfies
generalized eigenvalues ρW by gradient descent on the
Δxk = Δdk cos φk manifold of orthogonal matrices.
= α(Temperature,Pressure)
⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟ d k cos φ2
⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟
At qk2 Results: When the two underlying stochastic processes Xt
and At generate non-Gaussian data and are governed by
where α is a meteorological correction factor, see for exam- probability distributions which are reasonably well dis-
ple [35, p. 310]. This would allow us to solve the problem tinguishable via linear combinations of higher order sta-
immediately by inverting Q and applying Q−1 =: W to the tistical moments, the splitting achieved via kernel ICA is
sequences of measurements — however we do not want to convincing. For an exemplary application to a simulated
do this but demand that the algorithm finds the most prob- dataset, see Fig. 7.
able decomposition based not on physically or geometri- The situation exhibited in Fig. 6 is not entirely realis-
cally motivated knowledge but solely on the probabilis- tic and would need to be modified for any actual appli-
tic assumption that the x-coordinates and atmosphere are cation in practice — however the idea of splitting several
stochastically independent of each other. Neither Xt nor At
sequences of measurements into maximally independent
are allowed to be Gaussian since approximate stochastical
components is a promising one.
independence will be achieved by maximizing some mea-
The framework is applicable whenever measurements
sure of non-Gaussianity [8].
generate for each point in time a whole vector of values
Main idea: Since the measurements Mtk : Ω ∋ ω → and there are reasons to suspect that each entry in that vec-
Mtk (ω) ∈ ℝ are supposedly both linear mixtures of Xt and tor is a linear mixture of quantities of actual interest. This
J. Butt et al., ML and geodesy | 129
n
y(x)
̂ = sign (∑ λj yj K(x, xj ) + d) (33)
j=1 Assumption: The set of time series χ+ associated to dan-
gerous behaviour is approximately linearly separable from
with d = 1/yi − f (xi ) for any i = 1, . . . , n. This classifier is
the set χ− after embedding it into the infinite dimensional
termed support vector machine and we will apply it imme-
Hilbert space Hk of features via the map
diately to the problem outlined before.
Let the time series in Fig. 9 be the input x for our clas- ϕ : ℝm ∋ c → ϕ(x) = k(x, ⋅) ∈ Hk .
sification problem; the sets χ+ and χ− providing exemplary
time series associated to harmful and harmless situations Furthermore assume that the euclidean distance is a
are sampled there as well by listing some representatives. meaningful measure of closeness between time series. Us-
Given: A sequence {xj }nj=1 of deformation measurements age of the linear kernel k(x1 , x2 ) = ⟨x1 , x2 ⟩ℝm derived from
the inner product in ℝn is then justified.
xj = {xji }m m i n
i=1 ∈ ℝ at times ti ∈ T in the format {(ti , xj )}i=1
m
where the sequence xj ∈ ℝ is the interesting part and the Main idea: Solve the optimization problems specified in
time information will regularly be discarded. There is fur- equations (31) or (32) to find a parameter vector λ and a
thermore a training set of examples {(xj , yj )}nj=1 where again constant d such that the classifier ŷ assembled from λ and
each xj is a time series and yj is the corresponding label. d according to equation (33) has both acceptable regularity
and misclassification rate on the training set. Afterwards,
Goal: Emulate the input-output behaviour mapping time
apply ŷ : ℝm → {−1, +1} to unseen time series to classify
series onto danger assessments via the class prediction
them.
function y(x)
̂ defined in equation (33). It makes use of the
RKHS Hk of functions on ℝm with reproducing kernel k(⋅, ⋅) Results: Support vector machines usually perform reason-
that maps pairs of time series onto a real number quanti- ably well although more sophisticated methods exist for
fying their similarity. function approximation problems [9]. Table 2 summarizes
J. Butt et al., ML and geodesy | 131
the SVM’s behaviour in terms of errors of the first and sec- ∀x+ ∈ χ+ and x− ∈ χ− because ⟨f ̃, Ax⟩ℝ = ⟨AT f ̃, x⟩ℝm for
ond kind. For the estimation of empirical error probabil- any A : ℝm → ℝ. For a simple example like this, embed-
ities the cycle of simulating ground truth, fitting an svm dings into infinite dimensional Hk are unnecessary. When
and classifying 100 randomly chosen time series was rerun the underlying classification rule (=failure mechanism in
100 times while the amount of training examples was sub- our example) is complicated or unknown and danger as-
jected to systematic change. Classification was done using sessment is demanded based only on a sequence of mea-
the Matlab built-in “fitclinear”. surements somewhat correlated with the reasons for crit-
ical behaviour, they may however prove helpful. [32] pro-
Table 2: Performance of SVM’s for the specific task outlined above. vide some examplary applications that go into this direc-
tion and demonstrate the usefulness of including kernel-
error samples based nonlinearities into estimation.
10 102 103 104 It is possible to establish that the inner-product-based
type I in % 6.6 1.8 0.6 0.2 decision rule for linear SVM’s is the same as the Bayes rule
type II in % 6.7 2.0 0.6 0.2
Empirically estimated probabilities of type I error (incorrect rejection log(fY|X (y = +1|x)fY|X
−1
(y = −1|x)) ≷ 0 ⇒ y(x)
̂ = ±1
of null hypothesis H0 ) and type II error (failure to reject incorrect null
hypothesis H0 ). H0 is the hypothesis that y(x) = −1. for some semiparametric probability density function fY
whose parameters haven been inferred via Maximum Like-
lihood estimation [11]. This is obviously a form of likeli-
We want to close this section with a few clarifying hood ratio test as employed for comparing two statistical
remarks regarding simulation methodology and a link to models in classical hypothesis testing.
classical hypothesis testing.
This example is again purely synthetic. We randomly
sampled from a stochastic process that corresponds to
Brownian motion, each realization was considered to be
4 Conclusion and outlook
a time series xj ∈ ℝm of deformation measurements. If the
In this paper, we investigated the interface between geode-
best fitting line through {(ti , xji )}m
i=1 had positive slope, the
tic data analysis and machine learning algorithms. It
situation was classified as dangerous and harmless other-
turned out that adjustment as used in the geodetic com-
wise. This generation rule for our synthetic ground truth
munity can be interpreted as a learning algorithm via
was not communicated to the SVM however, which only
proper relabeling of the terms occuring in the optimiza-
received the labeled training examples and had to infer the
tion task arising during maximum likelihood estimation
rule by itself. Notice that even for the trivial finite dimen-
under assumption of Gaussianity. This was exemplified in
sional kernel k(xi , xj ) = ⟨xi , xj ⟩ℝm the limit performance
a simple application, in which adjustment, geostatistics
should be almost perfect separation since the underlying
and splines were employed for regression and interpola-
true classification rule is
tion purposes. They were shown to essentially agree when
Ax ≥ 0 ⇒ y(x) = +1 applicable. A table was provided that served as a guide-
Ax < 0 ⇒ y(x) = −1 line to translate between adjustment theoretic and ma-
chine learning motivated treatments of estimation prob-
where A : ℝm → ℝ is a linear operator consisting of a
lems.
concatenation of line fitting and calculation of the deriva-
Apart from the different role of stochasticity in both
tive of that line — both operations being linear in the data.
fields, one of the main differences is the focus on high di-
Therefore Aχ+ is linearly separable from Aχ− in ℝ1 , and the
mensional embeddings of data. It was outlined, how infi-
underlying decision rule can be written as
nite dimensional problems can be efficiently solved using
⟨f ̃, Ax+ ⟩ℝ ≥ 0 kernels and some intuition was gathered by tackling a se-
⟨f ̃, Ax− ⟩ℝ < 0 quence of instructive albeit simple geodetic toy problems
— not all of which were known to be easily solvable. The al-
∀x+ ∈ χ+ and x− ∈ χ− where f ̃ is any nonzero number. This gorithms are shown to be demonstrably easy to implement
implies for f = AT f ̃ ∈ ℝm the equivalent decision rule with further examples freely available on GitHub.1
⟨f , x+ ⟩ℝm ≥ 0
⟨f , x− ⟩ℝm < 0 1 https://github.com/jemil-butt/ML_tutorials_geodesy
132 | J. Butt et al., ML and geodesy
[21] F. Larkin, Gaussian measure in Hilbert space and applications [30] B. Riedel and M. Heinert, An adapted support vector machine
in numerical analysis, Rocky Mountain J. Math., 2 (1972), for velocity field interpolation at the Baota landslide, in Proc.
pp. 379–422. Application of Artificial Intelligence in Engineering Geodesy –
[22] T. M. Mitchell, Machine Learning, McGraw-Hill, New York, 1st International Workshop, Vienna, 2008, pp. 101–116.
1997. [31] S. Roman, Advanced Linear Algebra, Springer, Berlin
[23] H. Moritz, Advanced least-squares methods, Reports of the Heidelberg, 2007.
Department of Geodetic Science, 175, (1972). [32] B. Schoelkopf and A. J. Smola, Learning with Kernels –
[24] H. Neuner, Model selection for system identification by means Support Vector Machines, Regularization, Optimization, and
of artificial neural networks, Journal of Applied Geodesy, 6 Beyond, MIT Press, Cambridge, 2002.
(2012), pp. 117–124. [33] G. Strang, Introduction to Linear Algebra,
[25] J. Neveu, Processus aléatoires gaussiens, 1968. Wellesley-Cambridge Press, Wellesley, 2016.
[26] W. Niemeier, Ausgleichungsrechnung – Statistische [34] M. Wiering and M. v. Otterlo, Reinforcement Learning –
Auswertemethoden, Walter de Gruyter, Berlin, 2008. State-of-the-Art, Springer, Berlin Heidelberg, 2012.
[27] S. J. Press, Applied Multivariate Analysis – Using Bayesian [35] B. Witte and P. Sparla, Vermessungskunde und Grundlagen
and Frequentist Methods of Inference, 2nd Edition, Courier der Statistik für das Bauwesen, Vde Verlag GmbH, Berlin,
Corporation, New York, 2012. Offenbach, 2015.
[28] C. E. Rasmussen and C. K. I. Williams, Gaussian Processes for
Machine Learning, MIT Press, Cambridge, 2006.
[29] A. Reiterer, U. Egly, T. Vicovac, E. Mai, S. Moafipoor, D.
Grejner-Brzezinska and C. Toth, Application of artificial
intelligence in geodesy – a review of theoretical foundations
and practical examples, Journal of Applied Geodesy, 4 (2010),
pp. 201–217.