Sci Kit GStat
Sci Kit GStat
Sci Kit GStat
Release 0.2.8
Mirko Mälicke
i
ii
CHAPTER
ONE
SciKit-Gstat is a scipy-styled analysis module for geostatistics. It includes two base classes Variogram and
DirectionalVariogram. Both have a very similar interface and can compute experimental variograms and
model variograms. The module makes use of a rich selection of semi-variance estimators and variogram model func-
tions, while being extensible at the same time.
With version 0.2.4, the class SpaceTimeVariogram has been added. It computes space-time experimental vari-
ogram. However, space-time modeling is not implemented yet.
With version 0.25, the class OrdinaryKriging has been added. It is working and can be used. However, it is not
documented, the arguments might still change, multiprocessing is not implemented and the krige algorithm is not yet
very efficient.
Note: Scikit-gstat was rewritten in major parts. Most of the changes are internal, but the attributes and behaviour of
the Variogram has also changed substantially. A detailed description of the new versions usage will follow. The last
version of the old Variogram class, 0.1.8, is kept in the version-0.1.8 branch on GitHub, but not developed any further.
It is not compatible to the current version.
1
SciKit GStat Documentation, Release 0.2.8
TWO
HOW TO CITE
In case you use SciKit-GStat in other software or scientific publications, please reference this module. It is published
and has a DOI. It can be cited as:
Mirko Mälicke, & Helge David Schneider. (2019, November 7). Scikit-GStat 0.2.6: A scipy flavoured
geostatistical analysis toolbox written in Python. (Version v0.2.6). Zenodo. http://doi.org/10.5281/
zenodo.3531816
2.1 Installation
The package can be installed directly from the Python Package Index or GitHub. The version on GitHub might be
more recent, as only stable versions are uploaded to the Python Package Index.
2.1.1 PyPI
2.1.2 GitHub
2.1.3 Note
Depending on you OS, you might run into problems installing all requirements in a clean Python environment. These
problems are usually caused by the scipy and numba package, which might need to be compiled. From our experience,
no problems should occur, when an environment manager like anaconda is used. Then, the requirements can be
installed like:
3
SciKit GStat Documentation, Release 0.2.8
The main class of scikit-gstat is the Variogram. It can directly be imported from the module, called skgstat. The main
class can easily be demonstrated on random data.
In [4]: plt.style.use('ggplot')
In [5]: np.random.seed(42)
In [7]: np.random.seed(42)
The Variogram needs at least an array of coordinates and an array of values on instantiation.
In [10]: print(V)
spherical Variogram
-------------------
Estimator: matheron
Sill: 36.65
Nugget: 0.00
2.2.2 Plot
In [11]: V.plot()
Out[11]: <Figure size 800x500 with 2 Axes>
With version 0.2, the histogram plot can also be disabled. This is most useful, when the binning method for the lag
classes is changed from ‘even’ step classes to ‘uniform’ distribution in the lag classes.
In [12]: V.set_bin_func('uniform')
In [13]: V.plot(hist=False)
Out[13]: <Figure size 800x400 with 1 Axes>
This user guide shall help you getting started with scikit-gstat package along with a more general introduction
to variogram analysis.
2.3.1 Introduction
General
This user guide part of scikit-gstat’s documentation is meant to be an user guide to the functionality offered
by the module along with a more general introduction to geostatistical concepts. The main use case is to hand this
description to students learning geostatistics, whenever scikit-gstat is used. But before introducing variograms,
the more general question what geostatistics actually are has to be answered.
Note: This user guide is meant to be an introduction to geostatistics. In case you are already familiar with the topic,
you can skip this section.
What is geostatistics?
The basic idea of geostatistics is to describe and estimate spatial correlations in a set of point data. While the main
tool, the variogram, is quite easy to implement and use, a lot of assumptions are underlying it. The typical application
is geostatistics is an interpolation. Therefore, although using point data, a basic concept is to understand these point
data as a sample of a (spatially) continuous variable that can be described as a random field 𝑟𝑓 , or to be more precise,
a Gaussian random field in many cases. The most fundamental assumption in geostatistics is that any two values
𝑥𝑖 and 𝑥𝑖+ℎ are more similar, the smaller ℎ is, which is a separating distance on the random field. In other words:
close observation points will show higher covariances than distant points. In case this most fundamental conceptual
assumption does not hold for a specific variable, geostatistics will not be the correct tool to analyse and interpolate this
variable.
One of the most easiest approaches to interpolate point data is to use IDW (inverse distance weighting). This technique
is implemented in almost any GIS software. The fundamental conceptual model can be described as:
∑︀𝑁
𝑖 𝑤𝑖 * 𝑍(𝑖)
𝑍𝑢 =
𝑁
where 𝑍𝑢 is the value of 𝑟𝑓 at a non-observed location with 𝑁 observations around it. These observations get weighted
by the weight 𝑤𝑖 , which can be calculated like:
1
𝑤𝑖 = −→
||𝑢𝑥𝑖 ||
where 𝑢 is the not observed point and 𝑥𝑖 is one of the sample points. Thus, ||−→|| is the 2-norm of the vector between
𝑢𝑥𝑖
the two points: the Euclidean distance in the coordinate space (which by no means has to be limited to the R2 case).
This basically describes a concept, where a value of the random field is estimated by a distance-weighted mean of
the surrounding points. As close points shall have a higher impact, the inverse distance is used and thus the name of
inverse distance weighting.
In the case of geostatistics this basic model still holds, but is extended. Instead of depending the weights exclusively
on the separating distance, a weight will be derived from a variance over all values that are separated by a similar
distance. This has the main advantage of incorporating the actual (co)variance found in the observations and basing
the interpolation on this (co)variance, but comes at the cost of some strict assumptions about the statistical properties
of the sample. Elaborating and assessing these assumptions is one of the main challenges of geostatistics.
Geostatistical Tools
Geostatistics is a wide field spanning a wide variety of disciplines, like geology, biology, hydrology or geomorphology.
Each discipline defines their own set of tools, and apparently definitions, and progress is made until today. It is not the
objective of scikit-gstat to be a comprehensive collection of all available tools. That would only be possible if
professionals from each discipline contribute to the project. The objective is more to offer some common tools and
simplify the process of geostatistical analysis and tool development thereby. However, you split geostatistics into three
main fields, each of it with its own tools:
• variography: with the variogram being the main tool, the variography focuses on describing, visualizing and
modelling covariance structures in space and time.
• kriging: is an interpolation method, that utilizes a variogram to find the estimate for weights as shown in the
section above.
• geostatistical simulation: is aiming on generate random fields that fit a given set of observations or a pre-
defined variogram.
Note: I am planning to implement common tools from all three fields. However, up to now, I am only focusing on
variograms and no field generators or kriging procedures are available.
2.3.2 Variography
The variogram
General
We start by constructing a random field and sample it. Without knowing about random field generators, an easy way
to go is to stick two trigonometric functions together and add some noise. There should be clear spatial correlation
apparent.
In [1]: import numpy as np
In [3]: plt.style.use('ggplot')
In [6]: np.random.seed(42)
Using scikit-gstat
It’s now easy and straightforward to calculate a variogram using scikit-gstat. We need to sample the field and
pass the coordinates and value to the Variogram Class.
# random coordinates
In [12]: np.random.seed(42)
In [16]: V.plot()
Out[16]: <Figure size 800x500 with 2 Axes>
From my personal point of view, there are three main issues with this approach:
• If one is not an geostatistics expert, one has no idea what he actually did and can see in the presented figure.
• The figure includes an spatial model, one has no idea if this model is suitable and fits the observations (wherever
they are in the figure) sufficiently.
• Refer to the __init__ method of the Variogram class. There are 10+ arguments that can be set optionally.
The default values will most likely not fit your data and requirements.
Therefore one will have to understand how the Variogram Class works along with some basic knowledge about
variography in oder to be able to properly use scikit-gstat.
However, what we can discuss from the figure, is what a variogram actually is At its core it relates a dependent variable
to an independent variable and, in a second step, tries to describe this relationship with a statistical model. This model
on its own describes some of the spatial properties of the random field and can further be utilized in an interpolation
to select nearby points and weight them based on their statistical properties.
The variogram relates the separating distance between two observation points to a measure of variability of values
at that given distance. Our expectation is that variance is increasing with distance, what can basically be seen in the
presented figure.
Distance
Consider the variogram figure from above, with which an independent and dependent variable was introduced. In
statistics it is common to use dependent variable as an alias for target variable, because its value is dependent on
the state of the independent variable. In the case of a variogram, this is the metric of variance on the y-axis. The
independent variable is a measure of (usually) Euclidean distance.
Consider observations taken in the environment, it is fairly unlikely to find two pairs of observations where the sepa-
rating distance between the coordinates match exactly the same value. Therefore it is useful to group all point pairs
at the same distance lag together into one group, or bin. Beside practicability, there is also another reason, why one
would want to group point pairs at similar separating distances together into one bin. This becomes obvious, when one
plots the difference in value over the distance for all point pair combinations that can be formed for a given sample.
The Variogram Class has a function for that: distance_difference_plot:
In [17]: V.distance_difference_plot()
Out[17]: <Figure size 800x600 with 1 Axes>
While it is possible to see the increasing variability with increasing distance here quite nicely, it is not possible to guess
meaningful moments for the distributions within the bins. Last but not least, to derive a simple model as presented in
the variogram figure above by the green line, we have to be able to compress all values at a given distance lag to one
estimation of variance. This would not be possible from the the figure above.
Note: There are also procedures that can fit a model directly based on unbinned data. As none of these methods is
implemented into scikit-gstat, they will not be discussed here. If you need them, you are more than welcome to
Binning the separating distances into distance lags is therefore a crucial and most important task in variogram analysis.
The final binning must discretizise the distance lag at a meaningful resolution at the scale of interest while still holding
enough members in the bin to make valid estimations. Often this is a trade-off relationship and one has to find a
suitable compromise.
Before diving into binning, we have to understand how the Variogram Class handles distance data. The distance
calculation can be controlled by the dist_func argument, which takes either a string or a function. The default
value is ‘euclidean’. This value is directly passed down to the pdist as the metric argument. Consequently, the
distance data is stores as a distance matrix for all input locations passed to Variogram on instantiation. To be more
precise, only the upper triangle is stored in an array with the distance values sorted row-wise. Consider this very
straightforward set of locations:
In [20]: V.distance
Out[20]: array([1. , 1.414, 1. , 1. , 1.414, 1. ])
In [22]: print(squareform(V.distance))
[[0. 1. 1.414 1. ]
[1. 0. 1. 1.414]
[1.414 1. 0. 1. ]
[1. 1.414 1. 0. ]]
Binning
As already mentioned, in real world observation data, there will hardly be two observation location pairs at exactly the
same distance. Thus, we need to group information about point pairs at similar distance together, to learn how similar
their observed values are. With a Variogram, we will basically try to find and describe some systematic statistical
behavior from these similarities. The process of grouping distance data together is called binning.
scikit-gstat has two different methods for binning distance data. They can be set using the bin_func attribute.
You have to pass the name of the method. This has to be one of ['even', 'uniform] to use one of the predefined
binning functions. Both methods will use two parameters to calculate the bins from the distance matrix: n, the amount
of bins, and maxlag, the maximum distance lag to be considered. You can choose both parameters during Variogram
instantiation as n_lags and maxlag. The 'even' method will then form n bins from 0 to maxlag of same width.
The 'uniform' method will form n bins from 0 to maxlag with the same value count in each bin. The following
example should illustrate this:
Now, look at the different bin edges for the calculated dummy distance matrix:
Observation differences
By the term observation differences, the distance between the observed values are meant. As already layed out, the
main idea of a variogram is to systematially relate similarity of observations to their spatial proximity. The spatial
part was covered in the sections above, finalized with the calculation of a suitable binning of all distances. We want to
relate exactly these bins to a measure of similarity of all observation point pairs that fall into this bin.
That’s basically it. We need to do three more steps to come up with one value per bin, statistically describing the
similarity at that distance.
1. Find all point pairs that fall into a bin
2. Calculate the distance (difference) of the observed values
3. Describe all differences by one number
Finding all pairs within a bin is straightforward. We already have the bin edges and all distances between all possible
observation point combinations (stored in the distance matrix). Using the squareform function of scipy, we could
turn the distance matrix into a 2D version. Then the row and column indices align with the values indices. However,
the Variogram Class implements a method for doing mapping a bit more efficiently.
Note: As of this writing, the actual iterator that yields the group number for each point is written in a plain Python
loop. This is not very fast and in fact the main bottleneck of this class. I am evaluating numba, cython or a numpy
based solution at the moment to gain better performance.
A array of bin groups for each point pair that is indexed exactly like the distance <skgstat.Variogram.
distance() array can be obtained by lag_groups.
This will be illustrated by some sample data (you can find the CSV file in the github repository of SciKit-GStat). You
can easily read the data using pandas.
In [32]: V.plot()
Out[32]: <Figure size 800x500 with 2 Axes>
Then, you can compare the first 10 point pairs from the distance matrix to the first 10 elements returned by the
lag_groups function.
# first 10 distances
In [33]: V.distance[:10]
Out[33]:
array([20.809, 51.478, 27.203, 44.721, 91.608, 71.784, 87.693, 85.446,
18.788, 23.195])
# first 10 groups
In [34]: V.lag_groups()[:10]
Out[34]: array([ 8, 21, 11, 18, -1, -1, -1, -1, 7, 9])
In [35]: V.bins
Out[35]:
array([ 2.4, 4.8, 7.2, 9.6, 12. , 14.4, 16.8, 19.2, 21.6, 24. , 26.4,
28.8, 31.2, 33.6, 36. , 38.4, 40.8, 43.2, 45.6, 48. , 50.4, 52.8,
55.2, 57.6, 60. ])
The first and 9th element are grouped into group 3. Their values are 20.8 and 18.8. The grouping starts with 0,
therefore the corresponding upper bound of the bin is at index 3 and the lower at 2. The bin edges are therefore 15.8
< x < 21.07. Consequently, the binning and grouping worked fine.
If you want to access all value pairs at a given group, it would of course be possible to use the machanism above to find
the correct points. However, Variogram class offers an iterator that already does that for you: lag_classes.
This iterator will yield all pair-wise observation value differences for the bin of the actual iteration. The first iteration
(index = 0, if you wish) will yield all differences of group id 0.
Note: lag_classes will yield the difference in value of observation point pairs, not the pairs themselves.
The only thing that is missing for a variogram is that we will not use the arithmetic mean to describe the realtionship.
Experimental variograms
The last stage before a variogram function can be modeled is to define an empirical variogram, also known as ex-
perimental variogram, which will be used to parameterize a variogram model. However, the expermental variogram
already contains a lot of information about spatial relationships in the data. Therefore, it’s worth looking at more
closely. Last but not least a poor expermental variogram will also affect the variogram model, which is ultimatively
used to interpolate the input data.
The previous sections summarized how distance is calculated and handeled by the Variogram class. The
lag_groups function makes it possible to find corresponding observation value pairs for all distance lags.
Finally the last step will be to use a more suitable estimator for the similarity of observation values at a specific lag.
In geostatistics this estimator is called semi-variance and the the most popular estimator is called Matheron estimator.
In case the estimator used is not further specified, Matheron was used. It is defined as
𝑁 (ℎ)
1 ∑︁
𝛾(ℎ) = * (𝑥)2
2𝑁 (ℎ) 𝑖=1
with:
𝑥 = 𝑍(𝑥𝑖 ) − 𝑍(𝑥𝑖+ℎ )
where 𝑍(𝑥𝑖 ) is the observation value at the i-th location 𝑥𝑖 . ℎ is the distance lag and 𝑁 (ℎ) is the number of point pairs
at that lag.
You will find more estimators in skgstat.estimators. There is the Cressie-Hawkins, which is more robust
to extreme values. Other so called robust estimators are Dowd or Genton. The remaining are experimental estimators
and should only be used with caution.
In [46]: fig.show()
Variogram models
The last step to describe the spatial pattern in a data set using variograms is to model the empirically observed and
calculated experimental variogram with a proper mathematical function. Technically, this setp is straightforward. We
need to define a function that takes a distance value (not a lag) and returns a semi-variance value. One big advantage
of these models is, that we can assure different things, like positive definitenes. Most models are also monotonically
increasing and approach an upper bound. Usually these models need three parameters to fit to the experimental
variogram. All three parameters have a meaning and are usefull to learn something about the data. This upper bound a
model approaches is called sill. The distance at which 95% of the sill are approached is called the range. That means,
the range is the distance at which observation values do not become more dissimilar with increasing distance. They
are statistically independent. That also means, it doesn’t make any sense to further describe spatial relationships of
observations further apart with means of geostatistics. The last parameter is the nugget. It is used to add semi-variance
to all values. Graphically that means to move the variogram up on the y-axis. The nugget is the semi-variance modeled
on the 0-distance lag. Compared to the sill it is the share of variance that can not be described spatially.
The sperical model is the most commonly used variogram model. It is characterized by a very steep, exponential
increase in semi-variance. That means it approaches the sill quite quickly. It can be used when observations show
strong dependency on short distances. It is defined like:
ℎ3
(︂ )︂
ℎ
𝛾 = 𝑏 + 𝐶0 * 1.5 * − 0.5 *
𝑟 𝑟
if h < r, and
𝛾 = 𝑏 + 𝐶0
else. b is the nugget, :math:C_0 is the sill, h is the input distance lag and r is the effective range. That is the range
parameter described above, that describes the correlation length. Many other variogram model implementations might
define the range parameter, which is a variogram parameter. This is a bit confusing, as the range parameter is specific
to the used model. Therefore I decided to directly use the effective range as a parameter, as that makes more sense in
my opinion.
As we already calculated an experimental variogram and find the spherical model in the skgstat.models sub-
module, we can utilize e.g. curve_fit from scipy to fit the model using a least squares approach.
Here, cof are now the coefficients found to fit the model to the data.
The Variogram Class does in principle the same thing. The only difference is that it tries to find a good initial
guess for the parameters and limits the search space for parameters. That should make the fitting more robust. Techni-
cally, we used the Levenberg-Marquardt algorithm above. Variogram can be forced to use the same by setting the
Variogram.fit_method to ‘lm’. The default, however, is ‘trf’, which is the Trust Region Reflective algorithm,
the bounded fit with initial guesses described above. You can use it like:
In [60]: V.plot();
In [61]: pprint(V.describe())
{'effective_range': 36.78585313528298,
'estimator': 'matheron',
'name': 'spherical',
(continues on next page)
In [63]: V.plot();
In [64]: pprint(V.describe())
{'effective_range': 36.78585313528298,
'estimator': 'matheron',
'name': 'spherical',
'nugget': 0,
'sill': 1.2473949087006804}
Note: In this example, the fitting method does not make a difference at all. Generally, you can say that Levenberg-
Marquardt is faster and TRF is more robust.
Exponential model
The exponential model is quite similar to the spherical one. It models semi-variance values to increase exponentially
with distance, like the spherical. The main difference is that this increase is not as steep as for the spherical. That
means, the effective range is larger for an exponential model, that was parameterized with the same range parameter.
Note: Remember that SciKit-GStat uses the effective range to overcome this confusing behaviour.
Consequently, the exponential can be used for data that shows a way too large spatial correlation extent for a spherical
model to capture.
Applied to the data used so far, you can see the similarity between the two models:
Gaussian model
The last fundamental variogram model is the Gaussian. Unlike the spherical and exponential models a very different
spatial relationship between semi-variance and distance. Following the Gaussian model, observations are assumed
to be similar up to intermediate distances, showing just a gentle increase in semi-variance. Then, the semi-variance
increases dramatically wihtin just a few distance units up to the sill, which is again approached asymtotically. The
model can be used to simulate very sudden and sharp changes in the variable at a specific distance, while being very
similar at smaller distances.
To show a typical Gaussian model, we will load another sample dataset.
In [72]: Vg.plot();
Matérn model
One of the not so commonly used models is the Matérn model. It is nevertheless implemented into scikit-gstat as it is
one of the most powerful models. Especially in cases where you cannot chose the appropiate model a priori so easily.
The Matérn model takes an additional smoothness paramter, that can change the shape of the function in between an
exponential model shape and a Gaussian one.
What is ‘direction’?
Space-time variography
2.3.3 Interpolation
Spatial interpolation
In geostatistics the procedure of spatial interpolation is known as Kriging. That goes back to the inventor of Kriging,
a South-African mining engineer called Dave Krige. He published the method in 1951. In many text books you will
also find the term prediction, but be aware that Kriging is still based on the assumption that the variable is a random
field. THerefore I prefer the term estimation and would label the Kriging method a BLUE, B est L inear U nbiased E
stimator. In general terms, the objective is to estimate a variable at a location that was not observed using observations
from close locations. Kriging is considered to be the best estimator, because we utilize the spatial structure described
by a variogram to find suitable weights for averaging the observations at close locations.
Given a set of observation points s and observation values at these locations 𝑍(𝑠), it can already be stated that the
estimation at an unobserved location 𝑍 * (𝑠0 ) is a weighted mean:
𝑁
∑︁
𝑍 * (𝑠0 ) = 𝜆𝑖 𝑍(𝑠𝑖 )
𝑖=0
where 𝑁 is the size of 𝑠 and 𝜆 is the array of weights. This is what we want to calculate from a fitted variogram model.
Assumed that 𝜆 had already been calculated, estimating the prediction is pretty straightforward:
In [1]: Z_s = np.array([4.2, 6.1, 0.2, 0.7, 5.2])
or shorter:
In [4]: Z_s.dot(lam)
Out[4]: 4.42
In the example above the weights were just made up. Now we need to understand how this array of weights can be
calculated.
Instead of just making up weights, we will now learn how we can utilize a variogram model to calculate the weights.
At its core a variogram describes how point observations become more dissimilar with distance. Point distances can
easily be calculated, not only for observed locations, but also for unobserved locations. As the variogram is only a
function of distance, we can easily calculate a semi-variance value for any possible combination of point pairs.
Assume we have five close observations for an unobserved location, like in the example above. Instead of making
up weights, we can use the semi-variance value as a weight, as a first shot. What we still need are locations and a
variogram model. For both, we can just make something up.
In [5]: x = np.array([4.0, 2.0, 4.1, 0.3, 2.0])
In [10]: squareform(distance_matrix)
Out[10]:
array([[0. , 4.031, 0.8 , 2.702, 1.7 , 0.5 ],
[4.031, 0. , 4.742, 1.803, 5.093, 3.606],
[0.8 , 4.742, 0. , 3.265, 1.879, 1.3 ],
[2.702, 1.803, 3.265, 0. , 4.163, 2.419],
[1.7 , 5.093, 1.879, 4.163, 0. , 1.772],
[0.5 , 3.606, 1.3 , 2.419, 1.772, 0. ]])
Next, we build up a variogram model of spherical shape, that uses a effective range larger than the distances in the
matrix. Otherwise, we would just calcualte the arithmetic mean.
The distances to the first point s0 are the first 5 elements in the distance matrix. Therefore the semi-variances are
calculated straightforward.
In [13]: variances = model(distance_matrix[:5])
Of course we could now use the inverse of these semi-variances to weigh the observations, but that would not be
correct. Remeber, that this array variances is what we want the target weights to incorporte. Whatever the weights
are, these variances should be respected. At the same time, the five points among each other also have distances and
therefore variances that should be respected. Or to put it differently. Take the first observation point 𝑠1 . The associated
variances 𝛾 to the other four points need to match the one just calculated.
Ok. First: 𝛾(𝑠1 , 𝑠1 ) is zero because the distance is obviously zero and the model does not have a nugget. All other
distances have already been calculated. 𝑎1 ...𝑎5 are factors. These are the weights used to satisfy all given semi-
variances. This is what we need. Obviously, we cannot calculate 5 unknown variables from just one equation. Lukily
we have four more observations. Writing the above equation for 𝑠2 , 𝑠3 , 𝑠4 , 𝑠5 . Additionally, we will write the linear
equation system in matrix form as a dot product of the 𝛾𝑖 and the 𝑎𝑖 part.
⎛ ⎞ ⎡ ⎤ ⎛ ⎞
𝛾(𝑠1 , 𝑠1 ) 𝛾(𝑠1 , 𝑠2 ) 𝛾(𝑠1 , 𝑠3 ) 𝛾(𝑠1 , 𝑠4 ) 𝛾(𝑠1 , 𝑠5 ) 𝑎1 𝛾(𝑠0 , 𝑠1 )
⎜𝛾(𝑠2 , 𝑠1 ) 𝛾(𝑠2 , 𝑠2 ) 𝛾(𝑠2 , 𝑠3 ) 𝛾(𝑠2 , 𝑠4 ) 𝛾(𝑠2 , 𝑠5 )⎟ ⎢𝑎2 ⎥ ⎜𝛾(𝑠0 , 𝑠2 )⎟
⎜ ⎟ ⎢ ⎥ ⎜ ⎟
⎜𝛾(𝑠3 , 𝑠1 ) 𝛾(𝑠3 , 𝑠2 ) 𝛾(𝑠3 , 𝑠3 ) 𝛾(𝑠3 , 𝑠4 ) 𝛾(𝑠3 , 𝑠5 )⎟ * ⎢𝑎3 ⎥ = ⎜𝛾(𝑠0 , 𝑠3 )⎟
⎜ ⎟ ⎢ ⎥ ⎜ ⎟
⎝𝛾(𝑠4 , 𝑠1 ) 𝛾(𝑠4 , 𝑠2 ) 𝛾(𝑠4 , 𝑠3 ) 𝛾(𝑠4 , 𝑠4 ) 𝛾(𝑠4 , 𝑠5 )⎠ ⎣𝑎4 ⎦ ⎝𝛾(𝑠0 , 𝑠4 )⎠
𝛾(𝑠5 , 𝑠1 ) 𝛾(𝑠5 , 𝑠2 ) 𝛾(𝑠5 , 𝑠3 ) 𝛾(𝑠5 , 𝑠4 ) 𝛾(𝑠5 , 𝑠5 ) 𝑎5 𝛾(𝑠0 , 𝑠5 )
That might look a bit complicated at first, but we have calculated almost everything. The last matrix are the variances
that we calculated in the last step. The first matrix is of same shape as the sqaureform distance matrix calculated in the
very begining. All we need to do is to map the variogram model on it and solve the system for the matrix of factors
𝑎1 . . . 𝑎5 . In Python, there are several strategies how you could solve this problem. Let’s at first build the matrix. We
need a distance matrix without 𝑠0 for that.
In [15]: dists = pdist(list(zip(x,y)))
In [16]: M = squareform(model(dists))
In [17]: pprint(M)
array([[0. , 1.721, 0.756, 1.798, 1.409],
[1.721, 0. , 1.298, 0.786, 0.551],
[0.756, 1.298, 0. , 1.574, 0.995],
[1.798, 0.786, 1.574, 0. , 0.743],
[1.409, 0.551, 0.995, 0.743, 0. ]])
In [18]: pprint(variances)
array([1.537, 0.341, 1.1 , 0.714, 0.214])
In [21]: pprint(a)
array([-0.022, 0.362, 0.018, 0.037, 0.593])
# calculate estimation
In [22]: Z_s.dot(a)
Out[22]: 5.226267185422778
That’s it. Well, not really. We might have used the variogram and the spatial structure infered from the data for getting
better results, but in fact our result is not unbiased. That means, the solver can choose any combination that satisfies
the equation, even setting everything to zero except one weight. That means 𝑎 could be biased. That would not be
helpful.
In [23]: np.sum(a)
Out[23]: 0.9872744357166217
In the last section we came pretty close to the Kriging algorithm. The only thing missing is to assure unbiasedness.
The weights sum up to almost one, but they are not one. We want to ensure, that they are always one. This is done
by adding one more equation to the linear equation system. Also, we will rename the 𝑎 array to 𝜆, which is more
frequently used for Kriging weights. The missing equation is:
𝑁
∑︁
𝜆=1
𝑖=1
This is the Kriging equation for Ordinary Kriging that can be found in text books. We added the ones to the result
array and into the matrix of semivariances. 𝜇 is a Lagrangian multiplier that will be used to estimate the Kriging
variance, which will be covered later. Ordinary Kriging still assumes the observation and their residuals to be normally
distributed and second order stationarity.
In [33]: np.sum(weights[:-1])
Out[33]: 1.0
The estimation did not change a lot, but the weights perfectly sum up to one now.
Kriging error
In the last step, we introduced a factor 𝜇. It was needed to solve the linear equation system while assuring that the
weights sum up to one. This factor can in turn be added to the weighted target semi-variances used to build the
equation system, to obtain the Kriging error.
This is really usefull when a whole map is interpolated. Using Kriging, you can also produce a map showing in which
regions the interpolation is more certain.
Example
We can use the data shown in the variography section, to finally interpolate the field and check the Kriging error. You
could either build a loop around the code shown in the previous section, or just use skgstat.
In [37]: V.plot()
Out[37]: <Figure size 800x500 with 2 Axes>
The OrdinaryKriging class need at least a fitted Variogram instance. Using min_points we can demand the
Kriging equation system to be build upon at least 5 points to yield robust results. If not enough close observations
are found within the effective range of the variogram, the estimation will not be calculated and a np.NaN value is
estimated.
The max_points parameter will set the upper bound of the equation system by using in this case at last the 20 nearest
points. Adding more will most likely not change the estimation, as more points will recieve small, if not negligible,
weights. But it will increase the processing time, as each added point will increase the Kriging equation system
dimensionality by one.
The mode parameter sets the method that will build up the equation system. There are two implemented: mode=’exact’
and mode=’estimate’. Estimate is much faster, but if not used carefully, it can lead to numerical instability quite
quickly. In the technical notes section of this userguide, you will find a whole section on the two modes.
Finally, we need the unobsered locations. The observations in the file were drawn from a 100x100 random field.
In [42]: s2 = ok.sigma.reshape(xx.shape)
2.4 Tutorials
The tutorials are designed to get you quickly started. It is assumed, that the tutorials are used together with the user
guide. The user guide is way more texty, while the tutorials are focused on code.
The main application for scikit-gstat is variogram analysis and Kriging. This Tutorial will guide you through the
most basic functionality of scikit-gstat. There are other tutorials that will explain specific methods or attributes
in scikit-gstat in more detail.
The Variogram and OrdinaryKriging classes can be loaded directly from skgstat. This is the name of the
Python module.
At the current version, there are some deprecated attributes and method in the Variogram class. They do not raise
DeprecationWarnings, but rather print a warning message to the screen. You can suppress this warning by
adding an SKG_SUPPRESS environment variable
You can find a prepared example data set in the ./data subdirectory. This example is extracted from a generated
Gaussian random field. We can expect the field to be stationary and show a nice spatial dependence, because it was
created that way. We can load one of the examples and have a look at the data:
Get a first overview of your data by plotting the x and y coordinates and visually inspect how the z spread out.
2.4. Tutorials 29
SciKit GStat Documentation, Release 0.2.8
As a quick reminder, the variogram relates pair-wise separating distances of coordinates and relates them to the
semi-variance of the corresponding values pairs. The default estimator used is the Matheron estimator:
𝑁 (ℎ)
1 ∑︁
𝛾(ℎ) = * (𝑍(𝑥𝑖 ) − 𝑍(𝑥𝑖+ℎ ))2
2𝑁 (ℎ) 𝑖=1
fig = V.plot(show=False)
The upper subplot show the histogram for the count of point-pairs in each lag class. You can see various things here:
• As expected, there is a clear spatial dependency, because semi-variance increases with distance (blue dots)
• The default spherical variogram model is well fitted to the experimental data
• The shape of the dependency is not captured quite well, but fair enough for this example
The sill of the variogram should correspond with the field variance. The field is unknown, but we can compare the sill
to the sample variance:
2.4. Tutorials 31
SciKit GStat Documentation, Release 0.2.8
The describe method will return the most important parameters as a dictionary. And we can simply print the
variogram ob,ect to the screen, to see all parameters.
[8]: pprint(V.describe())
{'effective_range': 39.50027313170537,
'estimator': 'matheron',
'name': 'spherical',
'nugget': 0,
'sill': 1.2553698556802062}
[9]: print(V)
spherical Variogram
-------------------
Estimator: matheron
Effective Range: 39.50
Sill: 1.26
Nugget: 0.00
1.3 Kriging
The Kriging class will now use the Variogram from above to estimate the Kriging weights for each grid cell. This is
done by solving a linear equation system. For an unobserved location 𝑠0 , we can use the distances to 5 observation
points and build the system like:
⎛ ⎞ ⎡ ⎤ ⎛ ⎞
𝛾(𝑠1 , 𝑠1 ) 𝛾(𝑠1 , 𝑠2 ) 𝛾(𝑠1 , 𝑠3 ) 𝛾(𝑠1 , 𝑠4 ) 𝛾(𝑠1 , 𝑠5 ) 1 𝜆1 𝛾(𝑠0 , 𝑠1 )
⎜𝛾(𝑠2 , 𝑠1 ) 𝛾(𝑠2 , 𝑠2 ) 𝛾(𝑠2 , 𝑠3 ) 𝛾(𝑠2 , 𝑠4 ) 𝛾(𝑠2 , 𝑠5 ) 1⎟ ⎢𝜆2 ⎥ ⎜𝛾(𝑠0 , 𝑠2 )⎟
⎜ ⎟ ⎢ ⎥ ⎜ ⎟
⎜𝛾(𝑠3 , 𝑠1 ) 𝛾(𝑠3 , 𝑠2 ) 𝛾(𝑠3 , 𝑠3 ) 𝛾(𝑠3 , 𝑠4 ) 𝛾(𝑠3 , 𝑠5 ) 1⎟ ⎢𝜆3 ⎥ ⎜𝛾(𝑠0 , 𝑠3 )⎟
⎜ ⎟*⎢ ⎥=⎜ ⎟
⎜𝛾(𝑠4 , 𝑠1 ) 𝛾(𝑠4 , 𝑠2 ) 𝛾(𝑠4 , 𝑠3 ) 𝛾(𝑠4 , 𝑠4 ) 𝛾(𝑠4 , 𝑠5 ) 1⎟ ⎢𝜆4 ⎥ ⎜𝛾(𝑠0 , 𝑠4 )⎟
⎜ ⎟ ⎢ ⎥ ⎜ ⎟
⎝𝛾(𝑠5 , 𝑠1 ) 𝛾(𝑠5 , 𝑠2 ) 𝛾(𝑠5 , 𝑠3 ) 𝛾(𝑠5 , 𝑠4 ) 𝛾(𝑠5 , 𝑠5 ) 1⎠ ⎣𝜆5 ⎦ ⎝𝛾(𝑠0 , 𝑠5 )⎠
1 1 1 1 1 0 𝜇 1
The transform method will apply the interpolation for passed arrays of coordinates. It requires each dimension
as a single 1D array. We can easily build a meshgrid of 100x100 coordinates and pass them to the interpolator. To
recieve a 2D result, we can simply reshape the result. The Kriging error will be available as the sigma attribute of
the interpolator.
From the Kriging error map, you can see how the interpolation is very certain close to the observation points, but
rather high in areas with only little coverage (like the upper left corner).
This tutorial will guide you through the theoretical variogram models available for the Variogram class.
In this tutorial you will learn:
• how to choose an appropiate model function
• how to judge fitting quality
• about sample size influence
[1]: from skgstat import Variogram, OrdinaryKriging
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('ggplot')
2.4. Tutorials 33
SciKit GStat Documentation, Release 0.2.8
There are three prepared data sets in the ./data folder. Each of them is a generated random field with different
underlying spatial properties. We will use only the first one, but you can re-run all the examples with any of the other
fields.
One of the features of scikit-gstat is the fact that it is programmed object oriented. That means, we can just
instantiate a Variogram object and start changing arguments unitl it models spatial dependency in our observations
well.
The data set includes 200 observations, consequently we can increase the number of lag classes. Additionally, the
histogram shows, that the lags over 100 units are backed up by just a few observations. Thus, we can limit the lag
classes to at least 100.
That’s not too bad. Now, we can try different theoretical models. It is always a good idea to judge the fit visually,
Especially, because we want it to fit to close bins more accurately than to distant bins, as they will ultimately determine
the Kriging weights. Nevertheless, Variogram has a rmse and a r2 property, that can be used as a quality measure
for the fit. The Variogram.plot function also accepts one or two matplotlib subplot axes to plot the lag classes
2.4. Tutorials 35
SciKit GStat Documentation, Release 0.2.8
histogram and variogram plot into them. The histogram can also be turned off.
V1.model = model
V1.plot(axes=axes[i], hist=False, show=False)
axes[i].set_title('Model: %s; RMSE: %.2f' % (model, V1.rmse))
e:\dropbox\python\scikit-gstat\skgstat\models.py:17: RuntimeWarning: invalid value
˓→encountered in double_scalars
This is quite important. We find all 6 models to describe the experimental variogram equally well in terms of RMSE.
However, the cubic and gaussian model are off the experimental values almost all the time. On short dis-
tances, the model is underestimating and on medium distances (up to the effective range) it is overestimating. The
exponential model is overestimating all the time. The spherical, matern and stable model seem to be
pretty good on short distances.
But what does this difference look like, when it comes to interpolation?
[10]: fields = []
fig, _a = plt.subplots(2,3, figsize=(18, 12), sharex=True, sharey=True)
(continues on next page)
V1.model = model
fields.append(interpolate(V1, axes[i]))
e:\dropbox\python\scikit-gstat\skgstat\models.py:14: RuntimeWarning: invalid value
˓→encountered in double_scalars
cubic
count 10000.000000
mean -0.412559
std 1.165185
(continues on next page)
2.4. Tutorials 37
SciKit GStat Documentation, Release 0.2.8
While most of the results look fairly similar there are a few things to notice:
1. Gaussian model is far off, producing estimations far outside the observed value ranges
2. All other models produce quite comparable numbers
3. The Matérn model fails, when recalculating the observations themselves
You have to remind that we had quite some observations. How does that look like, when the number of observations
is decreased?
In this section we will run the same code, but on just a quarter and 10% of all available observations. First, we look
into the variograms:
V2.plot(show=False);
[14]: fields = []
fig, _a = plt.subplots(2,3, figsize=(18, 12), sharex=True, sharey=True)
axes = _a.flatten()
for i, model in enumerate(('spherical', 'exponential', 'gaussian', 'matern', 'stable',
˓→ 'cubic')):
V2.model = model
fields.append(interpolate(V2, axes[i]))
2.4. Tutorials 39
SciKit GStat Documentation, Release 0.2.8
cubic
count 10000.000000
mean -0.436488
std 1.093714
min -3.907538
25% -1.026079
50% -0.307099
75% 0.352761
max 1.973759
In this section we will run the same code, but on just a quarter and 10% of all available observations. First, we look
into the variograms:
V3.plot(hist=False, show=False);
V3.model = model
V3.plot(axes=axes[i], hist=False, show=False)
axes[i].set_title('Model: %s; RMSE: %.2f' % (model, V3.rmse))
2.4. Tutorials 41
SciKit GStat Documentation, Release 0.2.8
In this example, we were basing the variogram analysis on only 20 observations. That is a number that could be
considered to be the lower bound of geostatistics. The RMSE values are decreasing as the experimental variograms
are more scattered. However, All six models seem to fit fairly well to the experimental data. It is hard to tell from just
the figure above which is correct.
[18]: fields = []
fig, _a = plt.subplots(2,3, figsize=(18, 12), sharex=True, sharey=True)
axes = _a.flatten()
for i, model in enumerate(('spherical', 'exponential', 'gaussian', 'matern', 'stable',
˓→ 'cubic')):
V3.model = model
fields.append(interpolate(V3, axes[i]))
Warning: for 144 locations, not enough neighbors were found within the range.
Warning: for 76 locations, not enough neighbors were found within the range.
Warning: for 214 locations, not enough neighbors were found within the range.
cubic
count 10000.000000
mean -1.058428
std 1.124708
min -4.364176
25% -1.698163
50% -1.004205
75% -0.275732
max 2.162505
2.4. Tutorials 43
SciKit GStat Documentation, Release 0.2.8
3. In many runs, NaN values were produced, because not enough neighbors could be found
We decreased the number of observations so far, that the max_points attribute came into effect. In the other cases
the Kriging interpolator found so many close observations, that limiting them to 15 had the effect, that estimations
were usually derived from obserations within a few units. Now, even if enough points are within the range, we use
observations from medium distances. Here, the different shapes of the models come into effect, as could be seen from
the last example.
This tutorial focuses on experimental variograms. It will guide you through the main semi-variance estimators avail-
able in scikit-gstat. Additionally, most of the parameters available for building an experimental variogram will
be discussed.
In this tutorial you will learn:
• what estimators are available
• how they differ
There are three prepared data sets in the ./data folder. Each of them is a generated random field with different
underlying spatial properties. We will use only the thrid one, but you can re-run all the examples with any of the other
fields.
The default estimator configured in Variogram is the Mathéron estimator (Mathéron, 1963). It is defined like:
𝑁 (ℎ)
1 ∑︁
𝛾(ℎ) = * (𝑍(𝑥𝑖 ) − 𝑍(𝑥𝑖+ℎ ))2
2𝑁 (ℎ) 𝑖=1
where:
• ℎ is the distance lag
• 𝑁 (ℎ) is the number of observation pairs in ℎ-lag class
• 𝑍(𝑥𝑖 ) is the observation at the 𝑖-th location 𝑥
V1.plot(show=False);
Following the histogram, we should set a maxlag. This property accepts a number 0 < 𝑚𝑎𝑥𝑙𝑎𝑔 < 1 to set the
maxlag to this ratio of the maximum separating distance. A number > 1 will use this at an absolute limit. You can
2.4. Tutorials 45
SciKit GStat Documentation, Release 0.2.8
also pass 'mean' or 'median'. This will calculate and set the mean or median of all distances in the distance
matrix as maxlag.
scikit-gstat implements more than only the Mathéron estimator. Setting estimator='cressie' will set
the Cressie-Hawkins estimator. It is implemented as follows (Cressie and Hawkins, 1980):
(︁ ∑︀𝑁 (ℎ) )︁4
1
𝑁 (ℎ) 𝑖=1 |𝑍(𝑥𝑖 ) − 𝑍(𝑥𝑖+ℎ )|0.5
2𝛾(ℎ) = 0.494 0.045
0.457 + 𝑁 (ℎ) + 𝑁 2 (ℎ)
with:
(︂ )︂
[𝑁ℎ /2] + 1
𝑘=
2
and:
(︂ )︂
𝑁ℎ
𝑞=
2
For Kriging, the difference on the first few lag classes is important, as no points will be used for estimation, that lies
outside the range.
2.4. Tutorials 47
SciKit GStat Documentation, Release 0.2.8
You can see from these results that the Cressie and the Dowd estimator pronounce the extreme values more. The
original field is also in the ./data folder. We can load it for comparison.
[10]: rf = np.loadtxt('data/rf_spherical_noise.txt')
fig, ax = plt.subplots(1, 1, figsize=(8,8))
art = ax.matshow(rf, origin='lower', cmap='plasma')
plt.colorbar(art, ax=ax)
[10]: <matplotlib.colorbar.Colorbar at 0x2a79e168348>
All three variograms were not able to capture the random variability. But the Matheron estimator was also not able to
reconstruct the maxima from the random field.
Cressie, N., and D. Hawkins (1980): Robust estimation of the variogram. Math. Geol., 12, 115-125.
Dowd, P. A., (1984): The variogram and kriging: Robust and resistant estimators, in Geostatistics for Natural Re-
sources Characterization. Edited by G. Verly et al., pp. 91 - 106, D. Reidel, Dordrecht.
Genton, M. G., (1998): Highly robust variogram estimation, Math. Geol., 30, 213 - 221.
Matheron, G. (1963). Principles of geostatistics. Economic Geology, 58(8), 1246–1266. https://doi.org/10.2113/
gsecongeo.58.8.1246
This chapter collects a number of technical advises for using scikit-gstat. These examples either give details on the
implementation or guide a correct package usage. This are technical notes, no tutorials. The application of the shown
examples might not make sense in every situation
General
The fit function of Variogram relies as of this writing on the scipy.optimize.curve_fit() function. That
function can be used by ust passing a function and a set of x and y values and hoping for the best. However, this
will not always yield the best parameters. Especially not for fitting a theoretical variogram function. There are a few
assumptions and simplifications, that we can state in order to utilize the function in a more meaningful way.
Default fit
The example below shows the performance of the fully unconstrained fit, performed by the Levenberg-Marquardt al-
gorithm. In scikit-gstat, this can be used by setting the fit_method parameter to lm. However, this is not recommended.
In [3]: plt.style.use('ggplot')
The fit of a spherical model will be illustrated with some made-up data representing an experimental variogram:
In [6]: y = [1,7,9,6,14,10,13,9,11,12,14,12,15,13]
In [7]: x = list(range(len(y)))
As the spherical function is compiled using numba, we wrap the function in order to let curve_fit correctly
infer the parameters. Then, fitting is a straightforward task.
In fact this looks quite good. But Levenberg-Marquardt is an unconstrained fitting algorithm and it could likely fail
on finding a parameter set. The fit method can therefore also run a box constrained fitting algorithm. It is the Trust
Region Reflective algorithm, that will find parameters within a given range (box). It is set by the fit_method=’tfr’
parameter and also the default setting.
Constrained fit
The constrained fitting case was chosen to be the default method in skgstat as the region can easily be specified.
Furthermore it is possible to make a good guess on initial values. As we fit actual variogram parameters, namely the
effective range, sill, nugget and in case of a stable or Matérn model an additional shape parameter, we know that these
parameters cannot be zero. The semi-variance is defined to be always positive. Thus the lower bound of the region
will be zero in any case. The upper limit can easily be inferred from the experimental variogram. There are some
simple rules, that all theoretical functions follow:
• the sill, nugget and their sum cannot be larger than the maximum empirical semi-variance
• the range cannot be larger than maxlag, or if maxlag is None the maximum value in the distances
The Variogram class will set the bounds to exactly these values as default behaviour. As an initial guess, it will
use the mean value of semi-variances for the sill, the mean separating distance as range and 0 for the nugget. In the
presented empirical variogram, difference between Levenberg-Marquardt and Trust Region Reflective is illustrated in
the example below.
# default plot
In [14]: plt.plot(x, y, 'rD')
Out[14]: [<matplotlib.lines.Line2D at 0x7fa915982860>]
The constrained fit, represented by the solid blue line is significantly different from the unconstrained fit (dashed,
green line). The fit is overall better as a quick RMSE calculation shows:
The last note about fitting a theoretical function, is that both methods assume all lag classes to be equally important
for the fit. In the specific case of a variogram this is not true.
While the standard Levenberg-Marquardt and Trust Region Reflective algorithms are both based on the idea of least
squares, they assume all observations to be equally important. In the specific case of a theoretical variogram function,
this is not the case. The variogram describes a dependency of covariance in value on the separation distances of the
observations. This model already implies that the dependency is stronger on small distances. Considering a kriging
interpolation as the main application of the variogram model, points on close distances will get higher weights for
the interpolated value of an unobserved location. The weight on large distances will be neglected anyway. Hence, a
good fit on small separating distances is way more important. The curve_fit function does not have an option for
weighting the squares of specific observations. At least it does not call it ‘weights’. In terms of scipy, you can define a
‘sigma’, which is the uncertainty of the respective point. The uncertainty 𝜎 influences the least squares calculation as
described by the equation:
∑︁ (︁ 𝑟 )︁2
𝜒𝑠𝑞 =
𝜎
That means, the larger 𝜎 is, the less weight it will receive. That also means, we can almost ignore points, by assigning
a ridiculous high 𝜎 to them. The following example should illustrate the effect. This time, the first 7 points will be
weighted by a weight 𝜎 = [0.1, 0.2, . . . 0.9] and the remaining points will receive a 𝜎 = 1. In the case of 𝜎 = 0.1, this
would change the least squares cost function to:
∑︁
𝜒𝑠𝑞;𝑥1:7 = (10𝑟)2
In [24]: cm = plt.get_cmap('autumn_r')
In the figure above, you can see how the last points get more and more ignored by the fitting. A smaller w value means
more weight on the first 7 points. The more yellow lines have a smaller sill and range.
The Variogram class accepts lists like sigma from the code example above as Variogram.fit_sigma property.
This way, the example from above could be implemented. However, Variogram.fit_sigma can also apply a
function of distance to the lag classes to derive the 𝜎 values. There are several predefined functions. These are:
• sigma=’linear’: The residuals get weighted by the lag distance normalized to the maximum lag distance, denoted
as 𝑤𝑛
• sigma=’exp’: The residuals get weighted by the function: 𝑤 = 𝑒1/𝑤𝑛
√︀
• sigma=’sqrt’: The residuals get weighted by the function: 𝑤 = (𝑤𝑛 )
• sigma=’sq’: The residuals get weighted by the function: 𝑤 = 𝑤𝑛2
The example below illustrates their effect on the sample experimental variograms used so far.
In [30]: cm = plt.get_cmap('gist_earth')
In [32]: s1 = X / np.max(X)
In [33]: s2 = np.exp(1. / X)
In [34]: s3 = np.sqrt(s1)
That’s it.
General
With version 0.2.2, directional variograms have been introduced. A directional variogram is a variogram where point
pairs are only included into the semivariance calculation if they fulfill a specified spatial relation. This relation is
expressed as a search area that identifies all directional points for a given specific point. SciKit-GStat refers to this
point as poi (point of interest). The implementation is done by the DirectionalVariogram class.
Note: The DirectionalVariogram is in general capable of handling n-dimensional coordinates. The applica-
tion of directional dependency is, however, only applied to the first two dimensions.
Understanding the search area of a directional is vital for using the DirectionalVariogram class. The search
area is controlled by the directional_model property which determines the shape of the search area. The extend
and orientation of this area is controlled by the parameters:
• azimuth
• tolerance
• bandwidth
As of this writing, SciKit-GStat supports three different search area shapes:
• triangle (default)
• circle
• compass
Additionally, the shape generation is controlled by the tolerance parameter (triangle, compass) and
bandwidth parameter (triangle, circle). The azimuth is used to rotate the search area into a desired
direction. An azimuth of 0° is heading East of the coordinate plane. Positive values for azimuth rotate the search area
clockwise, negative values counter-clockwise. The tolerance specifies how far the angle (against ‘x-axis’) between
two points can be off the azimuth to be still considered as a directional point pair. Based on this definition, two points
at a larger distance would generally be allowed to differ more from azimuth in terms of coordinate distance. Therefore
the bandwidth defines a maximum coordinate distance, a point can have from the azimuth line. The difference
between the triangle and the compass search area is that the triangle uses the bandwidth and the compass does
not.
The DirectionalVariogram has a function to plot the current search area. As the search area is specific to the
current poi, it has to be defined as the index of the coordinate to be used. The method is called search_area. Using
random coordinates, the search area shapes are presented below.
In [4]: plt.style.use('ggplot')
In [5]: np.random.seed(42)
In [7]: np.random.seed(42)
In [10]: DV.search_area(poi=3)
Out[10]: <Figure size 800x800 with 1 Axes>
In [12]: DV.set_directional_model('circle')
In [14]: DV.set_directional_model('compass')
In [16]: fig.show()
In order to apply different search area shapes and rotate them considering the given azimuth, a few preprocessing steps
are necessary. This can lead to some calculation overhead, as in the case of a compass model. The selection of point
pairs being directional is implemented by transforming the coordinates into a local reference system iteratively for
each coordinate. For multidimensional coordinates, only the first two dimensions are used. They are shifted to make
the current point of interest the origin of the local reference system. Then all other points are rotated until the azimuth
overlays the local x-axis. This makes the definition of different shapes way easier. In this local system, the bandwidth
can easily be applied to the transformed y-axis. The set_directional_model can also set a custom function as
search area shape, that accepts the current local reference system and returns the search area for the given poi. The
search area has to be returned as a shapely Polygon. Unfortunately, the tolerance and bandwidth parameter are
not passed yet.
Note: For the next release, it is planned to pass all necessary parameters to the directional model function. This
should greatly improve the definition of custom shapes. Until the implementation, the parameters have to be injected
directly.
The following example will illustrate the rotation of the local reference system.
In [18]: np.random.seed(42)
In [29]: fig.show()
After moving and shifting, any type of Geometry could be generated and passed as the search area.
In [30]: from shapely.geometry import Polygon
In [32]: DV.set_directional_model(M)
In [33]: DV.search_area(poi=0)
Out[33]: <Figure size 800x800 with 1 Axes>
Directional variograms
In principle, the DirectionalVariogram can be used just like the Variogram base class. In fact
DirectionalVariogram inherits most of the behaviour. All the functionality described in the previous sections
is added to the basic Variogram. All other methods and attributes can be used in the same way.
Warning: In order to implement the directional dependency, some methods have been rewritten in
DirectionalVariogram. Thus the following methods do not show the same behaviour:
• DirectionalVariogram.bins
• DirectionalVariogram._calc_groups
General
Generally speaking, the kriging procedure for one unobserved point (poi) can be broken down into three different
steps.
1. calculate the distance matrix between the poi and all observed locations to determine the in-range points and
apply the minimum and maximum points to be used constraints.
2. build the kriging equation system by calculating the semi-variance for all distances left over from step 1. For-
mulate squareform matrix and add the Lagrange multipliers
3. Solve the kriging equation system, usually by matrix inversion.
Hereby, we try to optimize the step 2 for performance. The basic idea is to estimate the semivariances instead of
calculating them on each iteration.
Calculating the semivariance for all elements in the kriging equation system gives us the best solution for the inter-
polation problem formulated by the respective variogram. The main point is that the distances for each unobserved
location do differ at least slightly from all other unobserved locations in a kriging modeling application. The variogram
parameters do not change, they are static within one modeling. This is what we want to use. The main advantage is,
that the effective range is constant in this setting. If we can now specify a precision at which we want to resolute the
range, we can pre-calculate the corresponding semivariance values. In the time-critical iterative formulation of the
kriging equation system, one would use the pre-calculated values of the closest distance.
The precision is a hyperparameter. That means it is up to the user to decide how precise the estimation of the kriging
itself can get given an estimated kriging equation system. The main advantage is, that the range and precision are
constant values within the scope of a simulation and therefore the expected uncertainty can be calculated and the
precision can be adjusted. This will take some effort fine-tune the kriging instance, but it can yield results, that are
only numerically different while still increasing the calculation time one magnitude of order.
In terms of uncertainty, one can think of a variogram function, where the given lag distance is uncertain. This deviation
can be calculated as:
𝑟𝑎𝑛𝑔𝑒
𝑑=
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛
and increasing the precision will obviously decrease the lag deviation.
Example
This example should illustrate the idea behind the estimation and show how the precision value can influence the result.
An arbitrary variogram is created and then recalculated by the OrdinaryKriging routine to illustrate the precision.
In [6]: np.random.seed(42)
# exact calculation
In [10]: x = np.linspace(0, ok.range * 1.3, 120)
# estimation
In [12]: ok.mode = 'estimate'
There is almost no difference between the two lines and the result that can be expected will be very similar, as the
kriging equation system will yield very similar weights to make the prediction.
If the precision is, however, chosen to coarse, there is a difference in the reconstructed variogram. This way, the idea
behind the estimation becomes quite obvious.
# make precision really small
In [17]: ok.precision = 10
Returns
property bin_func
Binning function
Returns an instance of the function used for binning the separating distances into the given amount of bins.
Both functions use the same signature of func(distances, n, maxlag).
The setter of this property utilizes the Variogram.set_bin_func to set a new function.
Returns binning_function
Return type function
See also:
Variogram.set_bin_func
clone()
Deep copy of self
Return a deep copy of self.
Returns
Return type Variogram
property coordinates
Coordinates property
Array of observation locations the variogram is build for. This property has no setter. If you want to change
the coordinates, use a new Variogram instance.
Returns coordinates
Return type numpy.array
data(n=100, force=False)
Theoretical variogram function
Calculate the experimental variogram and apply the binning. On success, the variogram model will be
fitted and applied to n lag values. Returns the lags and the calculated semi-variance values. If force is
True, a clean preprocessing and fitting run will be executed.
Parameters
• n (integer) – length of the lags array to be used for fitting. Defaults to 100, which will
be fine for most plots
• force (boolean) – If True, the preprocessing and fitting will be executed as a clean
run. This will force all intermediate results to be recalculated. Defaults to False
Returns variogram – first element is the created lags array second element are the calculated
semi-variance values
Return type tuple
describe()
Variogram parameters
Return a dictionary of the variogram parameters.
Returns
Return type dict
See also:
scipy.optimize(), scipy.optimize.curve_fit(), scipy.optimize.leastsq(),
scipy.optimize.least_squares()
property fit_sigma
Fitting Uncertainty
Set or calculate an array of observation uncertainties aligned to the Variogram.bins. These will be used to
weight the observations in the cost function, which divides the residuals by their uncertainty.
When setting fit_sigma, the array of uncertainties itself can be given, or one of the strings: [‘linear’, ‘exp’,
‘sqrt’, ‘sq’]. The parameters described below refer to the setter of this property.
Parameters sigma (string, array) – Sigma can either be an array of discrete uncertainty
values, which have to align to the Variogram.bins, or of type string. Then, the weights for
fitting are calculated as a function of (lag) distance.
• sigma=’linear’: The residuals get weighted by the lag distance normalized to the maxi-
mum lag distance, denoted as 𝑤𝑛
• sigma=’exp’: The residuals get weighted by the function: 𝑤 = 𝑒1/𝑤𝑛
√︀
• sigma=’sqrt’: The residuals get weighted by the function: 𝑤 = (𝑤𝑛 )
• sigma=’sq’: The residuals get weighted by the function: 𝑤 = 𝑤𝑛2
Returns
Return type void
Notes
Notes
Notes
Unlike Variogram.nrmse, nrmse_r is not normalized to the mean of y, but the differece of the maximum y
to its mean:
𝑅𝑀 𝑆𝐸
𝑁 𝑅𝑀 𝑆𝐸𝑟 =
𝑚𝑎𝑥(𝑦) − 𝑚𝑒𝑎𝑛(𝑦)
property parameters
Extract just the variogram parameters range, sill and nugget from the self.describe return
Returns
plot(axes=None, grid=True, show=True, hist=True)
Variogram Plot
Plot the experimental variogram, the fitted theoretical function and an histogram for the lag classes. The
axes attribute can be used to pass a list of AxesSubplots or a single instance to the plot function. Then
these Subplots will be used. If only a single instance is passed, the hist attribute will be ignored as only
the variogram will be plotted anyway.
Parameters
• axes (list, tuple, array, AxesSubplot or None) – If None, the plot
function will create a new matplotlib figure. Otherwise a single instance or a list of Axes-
Subplots can be passed to be used. If a single instance is passed, the hist attribute will be
ignored.
• grid (bool) – Defaults to True. If True a custom grid will be drawn through the lag
class centers
• show (bool) – Defaults to True. If True, the show method of the passed or created
matplotlib Figure will be called before returning the Figure. This should be set to False,
when used in a Notebook, as a returned Figure object will be plotted anyway.
• hist (bool) – Defaults to True. If False, the creation of a histogram for the lag classes
will be suppressed.
Returns
Return type matplotlib.Figure
preprocessing(force=False)
Preprocessing function
Prepares all input data for the fit and transform functions. Namely, the distances are calculated and the
value differences. Then the binning is set up and bin edges are calculated. If any of the listed subsets are
already prepared, their processing is skipped. This behaviour can be changed by the force parameter. This
will cause a clean preprocessing.
Parameters force (bool) – If set to True, all preprocessing data sets will be deleted. Use it
in case you need a clean preprocessing.
Returns
Return type void
property r
Pearson correlation of the fitted Variogram
Returns
property residuals
Model residuals
Calculate the model residuals defined as the differences between the experimental variogram and the the-
oretical model values at corresponding lag values
Returns
Return type numpy.ndarray
property rmse
RMSE
Calculate the Root Mean squared error between the experimental variogram and the theoretical model
values at corresponding lags. Can be used as a fitting quality measure.
Returns
Return type float
See also:
Variogram.residuals
Notes
set_bin_func(bin_func)
Set binning function
Sets a new binning function to be used. The new binning method is set by a string identifying the new
function to be used. Can be one of: [‘even’, ‘uniform’].
Parameters bin_func (str) – Can be one of:
• ’even’: Use skgstat.binning.even_width_lags for using n_lags lags of equal width up to
maxlag.
• ’uniform’: Use skgstat.binning.uniform_count_lags for using n_lags lags up to maxlag in
which the pairwise differences follow a uniform distribution.
Returns
Return type void
See also:
Variogram.bin_func(), skgstat.binning.uniform_count_lags(), skgstat.
binning.even_width_lags()
set_dist_function(func)
Set distance function
Set the function used for distance calculation. func can either be a callable or a string. The ranked distance
function is not implemented yet. strings will be forwarded to the scipy.spatial.distance.pdist function as
the metric argument. If func is a callable, it has to return the upper triangle of the distance matrix as a flat
array (Like the pdist function).
Parameters func (string, callable) –
Returns
Return type numpy.array
set_model(model_name)
Set model as the new theoretical variogram function.
set_values(values)
Set new values
Will set the passed array as new value array. This array has to be of same length as the first axis of the
coordinates array. The Variogram class does only accept one dimensional arrays. On success all fitting
parameters are deleted and the pairwise differences are recalculated.
Parameters values (numpy.ndarray) –
Returns
Return type void
See also:
Variogram.values()
to_DataFrame(n=100, force=False)
Variogram DataFrame
Returns the fitted theoretical variogram as a pandas.DataFrame instance. The n and force parameter control
the calaculation, refer to the data funciton for more info.
Parameters
• n (integer) – length of the lags array to be used for fitting. Defaults to 100, which will
be fine for most plots
• force (boolean) – If True, the preprocessing and fitting will be executed as a clean
run. This will force all intermediate results to be recalculated. Defaults to False
Returns
Return type pandas.DataFrame
See also:
Variogram.data()
transform(x)
Transform
Transform a given set of lag values to the theoretical variogram function using the actual fitting and pre-
processing parameters in this Variogram instance
Parameters x (numpy.array) – Array of lag values to be used as model input for the fitted
theoretical variogram model
Returns
Return type numpy.array
property value_matrix
Value matrix
Returns a matrix of pairwise differences in absolute values. The matrix will have the shape (m, m) with m
= len(Variogram.values). Note that Variogram.values holds the values themselves, while the value_matrix
consists of their pairwise differences.
Returns values – Matrix of pairwise absolute differences of the values.
Return type numpy.matrix
See also:
Variogram._diff
property values
Values property
Array of observations, the variogram is build for. The setter of this property utilizes the Vari-
ogram.set_values function for setting new arrays.
Returns values
Return type numpy.ndarray
See also:
Variogram.set_values
Visual representations, usage hints and implementation specifics are given in the docu-
mentation.
• azimuth (float) – The azimuth of the directional dependence for this Variogram, given
as an angle in degree. The East of the coordinate plane is set to be at 0° and is counted
clockwise to 180° and counter-clockwise to -180°. Only Points lying in the azimuth of a
specific point will be used for forming point pairs.
• tolerance (float) – The tolerance is given as an angle in degree- Points being dis-
located from the exact azimuth by half the tolerance will be accepted as well. It’s half the
tolerance as the point may be dislocated in the positive and negative direction from the
azimuth.
• bandwidth (float) – Maximum tolerance acceptable in coordinate units, which is
usually meter. Points at higher distances may be far dislocated from the azimuth in terms
of coordinate distance, as the tolerance is defined as an angle. THe bandwidth defines a
maximum width for the search window. It will be perpendicular to and bisected by the
azimuth.
• use_nugget (bool) – Defaults to False. If True, a nugget effet will be added to all
Variogram.models as a third (or fourth) fitting parameter. A nugget is essentially the y-
axis interception of the theoretical variogram function.
• maxlag (float, str) – Can specify the maximum lag distance directly by giving a
value larger than 1. The binning function will not find any lag class with an edge larger than
maxlag. If 0 < maxlag < 1, then maxlag is relative and maxlag * max(Variogram.distance)
will be used. In case maxlag is a string it has to be one of ‘median’, ‘mean’. Then the
median or mean of all Variogram.distance will be used. Note maxlag=0.5 will use half the
maximum separating distance, this is not the same as ‘median’, which is the median of all
separating distances
• n_lags (int) – Specify the number of lag classes to be defined by the binning function.
• verbose (bool) – Set the Verbosity of the class. Not Implemented yet.
_triangle(local_ref )
Triangular Search Area
Construct a triangular bounded search area for building directional dependent point pairs. The Search Area
will be located onto the current point of interest and the local x-axis is rotated onto the azimuth angle.
Parameters local_ref (numpy.array) – Array of all coordinates transformed into a local
representation with the current point of interest being the origin and the azimuth angle aligned
onto the x-axis.
Returns search_area – Search Area of triangular shape bounded by the current bandwidth.
Return type Polygon
Notes
C
/|\
a / | \ a
/__|h_\
A c B
The point of interest is C and c is the bandwidth. The angle at C (gamma) is the tolerance. From this, a and
then h can be calculated. When rotated into the local coordinate system, the two points needed to build the
search area A,B are A := (h, 1/2 c) and B:= (h, -1/2 c)
a can be calculated like:
𝑐
𝑎=
2 * 𝑠𝑖𝑛 𝛾2
(︀ )︀
See also:
DirectionalVariogram._compass(), DirectionalVariogram._circle()
_circle(local_ref )
Circular Search Area
Construct a half-circled bounded search area for building directional dependent point pairs. The Search
Area will be located onto the current point of interest and the local x-axis is rotated onto the azimuth angle.
The radius of the half-circle is set to half the bandwidth.
Parameters local_ref (numpy.array) – Array of all coordinates transformed into a local
representation with the current point of interest being the origin and the azimuth angle aligned
onto the x-axis.
Returns search_area – Search Area of half-circular shape bounded by the current bandwidth.
Return type Polygon
:raises ValueError : In case the DirectionalVariogram.bandwidth is None or 0.:
See also:
DirectionalVariogram._triangle(), DirectionalVariogram._compass()
_compass(local_ref )
Compass direction Search Area
Construct a search area for building directional dependent point pairs. The compass search area will not
be bounded by the bandwidth. It will include all point pairs at the azimuth direction with a given tolerance.
The Search Area will be located onto the current point of interest and the local x-axis is rotated onto the
azimuth angle.
Parameters local_ref (numpy.array) – Array of all coordinates transformed into a local
representation with the current point of interest being the origin and the azimuth angle aligned
onto the x-axis.
Returns search_area – Search Area of the given compass direction.
Return type Polygon
Notes
The necessary figure is build by searching for the intersection of a half-tolerance angled line with a vertical
line at the maximum x-value. Using polar coordinates, these points (positive and negative half-tolerance
angle) are the edges of the search area in the local coordinate system. The radius of a polar coordinate can
be calculated as:
𝑥
𝑟=
𝑐𝑜𝑠(𝛼/2)
The two bounding points P1 nad P2 (in local coordinates) are then (xmax, y) and (xmax, -y), with xmax
being the maximum local x-coordinate representation and y:
(︁ 𝛼 )︁
𝑦 = 𝑟 * 𝑠𝑖𝑛
2
See also:
DirectionalVariogram._triangle(), DirectionalVariogram._circle()
_direction_mask()
Directional Mask
Array aligned to self.distance masking all point pairs which shall be ignored for binning and grouping.
The one dimensional array contains all row-wise point pair combinations from the upper or lower triangle
of the distance matrix in case either of both is directional.
TODO: This array is not cached. it is used twice, for binning and grouping.
Returns mask – Array aligned to self.distance giving for each point pair combination a boolean
value whether the point are directional or not.
Return type numpy.array
property azimuth
Direction azimuth
Main direction for te selection of points in the formation of point pairs. East of the coordinate plane is
defined to be 0° and then the azimuth is set clockwise up to 180°and count-clockwise to -180°.
Parameters angle (float) – New azimuth angle in degree.
:raises ValueError : in case angle < -180° or angle > 180:
property bandwidth
Tolerance bandwidth
New bandwidth parameter. As the tolerance from azimuth is given as an angle, point pairs at high distances
can be far off the azimuth in coordinate distance. The bandwidth limits this distance and has the unnit of
the coordinate system.
Parameters width (float) – Positive coordinate distance.
:raises ValueError : in case width is negative:
local_reference_system(poi)
Calculate local coordinate system
The coordinates will be transformed into a local reference system that will simplify the directional de-
pendence selection. The point of interest (poi) of the current iteration will be used as origin of the local
reference system and the x-axis will be rotated onto the azimuth.
Parameters poi (tuple) – First two coordinate dimensions of the point of interest. will be
used as the new origin
Returns local_ref – Array of dimension (m, 2) where m is the length of the coordinates array.
Transformed coordinates in the same order as the original coordinates.
Return type numpy.array
search_area(poi=0, ax=None)
Plot Search Area
Parameters
• poi (integer) – Point of interest. Index of the coordinate that shall be used to visualize
the search area.
• ax (None, matplotlib.AxesSubplot) – If not None, the Search Area will be
plotted into the given Subplot object. If None, a new matplotlib Figure will be created and
returned
Returns
Return type plot
set_directional_model(model_name)
Set new directional model
The model used for selecting all points fulfilling the directional constraint of the Variogram. A predefined
model can be selected by passing the model name as string. Optionally a function can be passed that
accepts the current local coordinate system and returns a Polygon representing the search area. In this
case, the tolerance and bandwidth has to be incorporated by hand into the model. The azimuth is handled
by the class. The predefined options are:
• ‘compass’: includes points in the direction of the azimuth at given tolerance. The bandwidth pa-
rameter will be ignored.
• ‘triangle’: constructs a triangle with an angle of tolerance at the point of interest and union an rectangle
parallel to azimuth, once the hypotenuse length reaches bandwidth.
• ‘circle’: constructs a half circle touching the point of interest, dislocating the center at the distance of
bandwidth in the direction of azimuth. The half circle is union with an rectangle parallel to azimuth.
Visual representations, usage hints and implementation specifics are given in the documentation.
Parameters model_name (string, callable) – The name of the predefined model
(string) or a function that accepts the current local coordinate system and returns a Poly-
gon of the search area.
property tolerance
Azimuth tolerance
Tolerance angle of how far a point can be off the azimuth for being still counted as directional. A tolerance
angle will be applied to the azimuth angle symmetrically.
Parameters angle (float) – New tolerance angle in degree. Has to meet 0 <= angle <= 360.
:raises ValueError : in case angle < 0 or angle > 360:
Plot a 2D contour plot of the experimental variogram. The experimental semi-variance values are spanned
over a space - time lag meshgrid. This grid is (linear) interpolated onto the given resolution for visual
reasons. Then, contour lines are caluclated from the denser grid. Their number can be specified by levels.
Parameters
• ax (matplotlib.AxesSubplot, None) – If None a new matplotlib.Figure will be
created, otherwise the plot will be rendered into the given subplot.
• zoom_factor (float) – The experimental variogram will be interpolated onto a regu-
lar grid for visual reasons. The density of this plot can be set by zoom_factor. A factor of
10 will enlarge each of the axes by 10. Higher zoom_factors result in smoother contours,
but are expansive in calculation time.
• levels (int) – Number of levels to be formed for finding contour lines. More levels
result in more detailed plots, but are expansive in terms of calculation time.
• colors (str, list) – Will be passed down to matplotlib.pyplot.contour as c param-
eter.
• linewidths (float, list) – Will be passed down to matplotlib.pyplot.contour as
linewidths parameter.
• method (str) – The method used for densifying the meshgrid. Can be one of ‘fast’ or
‘precise’. Fast will use the scipy.ndimage.zoom method to incresae the node density. This
is fast, but cannot interpolate behind any NaN occurance. ‘Precise’ performs an actual
linear interpolation between the nodes using scipy.interpolate.griddata. This takes more
time, but the result is less smoothed out.
• kwargs (dict) – Other arguments that can be specific to contour or contourf type.
Accepts xlabel, ylabel, xlim and ylim as of this writing.
Returns fig – The Figure object used for rendering the contour plot.
Return type matplotlib.Figure
See also:
SpaceTimeVariogram.contourf()
contourf(ax=None, zoom_factor=100.0, levels=10, cmap='RdYlBu_r', method='fast', **kwargs)
Variogram 2D filled contour plot
Plot a 2D filled contour plot of the experimental variogram. The experimental semi-variance values are
spanned over a space - time lag meshgrid. This grid is (linear) interpolated onto the given resolution for
visual reasons. Then, contour lines are caluclated from the denser grid. Their number can be specified by
levels. Finally, each contour region is filled with a color supplied by the specified cmap.
Parameters
• ax (matplotlib.AxesSubplot, None) – If None a new matplotlib.Figure will be
created, otherwise the plot will be rendered into the given subplot.
• zoom_factor (float) – The experimental variogram will be interpolated onto a regu-
lar grid for visual reasons. The density of this plot can be set by zoom_factor. A factor of
10 will enlarge each of the axes by 10. Higher zoom_factors result in smoother contours,
but are expansive in calculation time.
• levels (int) – Number of levels to be formed for finding contour lines. More levels
result in more detailed plots, but are expansive in terms of calculation time.
• cmap (str) – Will be passed down to matplotlib.pyplot.contourf as cmap parameter. Can
be any valid color range supported by matplotlib.
• method (str) – The method used for densifying the meshgrid. Can be one of ‘fast’ or
‘precise’. Fast will use the scipy.ndimage.zoom method to incresae the node density. This
is fast, but cannot interpolate behind any NaN occurance. ‘Precise’ performs an actual
linear interpolation between the nodes using scipy.interpolate.griddata. This takes more
time, but the result is less smoothed out.
• kwargs (dict) – Other arguments that can be specific to contour or contourf type.
Accepts xlabel, ylabel, xlim and ylim as of this writing.
Returns fig – The Figure object used for rendering the contour plot.
Return type matplotlib.Figure
See also:
SpaceTimeVariogram.contour()
create_TMarginal()
Create an instance of skgstat.Variogram for the time marginal variogram by arranging the coordinates and
values and infer parameters from this SpaceTimeVariogram instance.
create_XMarginal()
Create an instance of skgstat.Variogram for the space marginal variogram by arranging the coordinates and
values and infer parameters from this SpaceTimeVariogram instance.
property distance
Distance matrices
Returns both the space and time distance matrix. This property is equivalent to two separate calls of
xdistance and tdistance.
Returns distance matrices – Returns a tuple of the two distance matrices in space and time.
Each distance matrix is a flattened upper triangle of the distance matrix squareform in row
orientation.
Return type (numpy.array, numpy.array)
property experimental
Experimental Variogram
Returns an experimental variogram for the given data. The semivariances are arranged over the spatial bin-
ning as defined in SpaceTimeVariogram.xbins and temporal binning defined in SpaceTimeVariogram.tbins.
Returns variogram – Returns an two dimensional array of semivariances over space on the first
axis and time over the second axis.
Return type numpy.ndarray
property fitted_model
get_marginal(axis, lag=0)
Marginal Variogram
Returns the marginal experimental variogram of axis for the given lag on the other axis. Axis can either be
‘space’ or ‘time’. The parameter lag specifies the index of the desired lag class on the other axis.
Parameters
• axis (str) – The axis a marginal variogram shall be calculated for. Can either be ‘
space’ or ‘time’.
• lag (int) – Index of the lag class group on the other axis to be used. In case this is 0,
this is often considered to be the marginal variogram of the axis.
Returns variogram – Marginal variogram of the given axis
the default plot. However, 3D plots can be, especially for scientific usage, a bit problematic. Therefore the
plot function can plot a variety of 3D and 2D plots.
Parameters
• kind (str) – Has to be one of:
– scatter
– surface
– contour
– contourf
– matrix
– marginals
• ax (matplotlib.AxesSubplot, mpl_toolkits.mplot3d.Axes3D,
None) – If None, the function will create a new figure and suitable Axes. Else, the Axes
object can be passed to plot the variogram into an existing figure. In this case, one has to
pass the correct type of Axes, whether it’s a 3D or 2D kind of a plot.
• kwargs (dict) – All keyword arguments are passed down to the actual plotting function.
Refer to their documentation for a more detailed description.
Returns fig
Return type matplotlib.Figure
See also:
SpaceTimeVariogram.scatter(), SpaceTimeVariogram.surface(),
SpaceTimeVariogram.marginals()
preprocessing(force=False)
Preprocessing
Start all necessary calculation jobs needed to derive an experimental variogram. This hasto be present
before the model fitting can be done. The force parameter will make all calculation functions to delete all
cached intermediate results and make a clean calculation.
Parameters force (bool) – If True, all cached intermediate results will be deleted and a clean
calculation will be done.
scatter(ax=None, elev=30, azim=220, c='g', depthshade=True, **kwargs)
3D Scatter Variogram
Plot the experimental variogram into a 3D matplotlib.Figure. The two variogram axis (space, time) will
span a meshgrid over the x and y axis and the semivariance will be plotted as z value over the respective
space and time lag coordinate.
Parameters
• ax (mpl_toolkits.mplot3d.Axes3D, None) – If ax is None (default), a new
Figure and Axes instance will be created. If ax is given, this instance will be used for the
plot.
• elev (int) – The elevation of the 3D plot, which is a rotation over the xy-plane.
• azim (int) – The azimuth of the 3D plot, which is a rotation over the z-axis.
• c (str) – Color of the scatter points, will be passed to the matplotlib c argument. The
function also accepts color as an alias.
• depthshade (bool) – If True, the scatter points will change their color according to
the distance from the viewport for illustration reasons.
• kwargs (dict) – Other kwargs accepted are only color as an alias for c and
figsize, if ax is None. Anything else will be ignored.
Returns fig
Return type matplotlib.Figure
Examples
In case an ax shall be passed to the function, note that this plot requires an AxesSubplot, that is capable of
creating a 3D plot. This can be done like:
See also:
SpaceTimeVariogram.surface()
set_bin_func(bin_func, axis)
Set binning function
Set a new binning function to either the space or time axis. Both axes support the methods: [‘even’,
‘uniform’]:
• ‘even’, create even width bins
• ‘uniform’, create bins of uniform distribution
Parameters
• bin_func (str) – Sepcifies the function to be loaded. Can be either ‘even’ or ‘uniform’.
• axis (str) – Specifies the axis to be used for binning. Can be either ‘space’ or ‘time’,
or one of the two shortcuts ‘s’ and ‘t’
See also:
skgstat.binning.even_width_lags(), skgstat.binning.uniform_count_lags()
set_model(model_name)
Set space-time model
Set a new space-time model. It has to be either a callable of correct signature or a string identifying one of
the predefined models
Parameters model_name (str, callable) – Either a callable of correct signature or a
valid model name. Valid names are:
• sum
• product
• product-sum
set_tdist_func(func_name)
Set new space distance function
Set a new function for calculating the distance matrix in the space dimension. At the moment only strings
are supported. Will be passed to scipy.spatical.distance.pdist as ‘metric’ attribute.
Parameters func_name (str) – The name of the function used to calculate the pairwise dis-
tances. Will be passed to scipy.spatial.distance.pdist as the ‘metric’ attribute.
:raises ValueError : in case a non-string argument is passed.:
set_values(values)
Set new values
The values should be an (m, n) array with m matching the size of coordinates first dimension and n is the
time dimension.
:raises ValueError : in case n <= 1 or values are not an array of correct: dimensionality :raises AttributeEr-
ror : in case values cannot be converted to a numpy.array:
set_xdist_func(func_name)
Set new space distance function
Set a new function for calculating the distance matrix in the space dimension. At the moment only strings
are supported. Will be passed to scipy.spatical.distance.pdist as ‘metric’ attribute.
Parameters func_name (str) – The name of the function used to calculate the pairwise dis-
tances. Will be passed to scipy.spatial.distance.pdist as the ‘metric’ attribute.
:raises ValueError : in case a non-string argument is passed.:
surface(ax=None, elev=30, azim=220, color='g', alpha=0.5, **kwargs)
3D Scatter Variogram
Plot the experimental variogram into a 3D matplotlib.Figure. The two variogram axis (space, time) will
span a meshgrid over the x and y axis and the semivariance will be plotted as z value over the respective
space and time lag coordinate. Unlike scatter the semivariance will not be scattered as points but rather
as a surface plot. The surface is approximated by (Delauney) triangulation of the z-axis.
Parameters
• ax (mpl_toolkits.mplot3d.Axes3D, None) – If ax is None (default), a new
Figure and Axes instance will be created. If ax is given, this instance will be used for the
plot.
• elev (int) – The elevation of the 3D plot, which is a rotation over the xy-plane.
• azim (int) – The azimuth of the 3D plot, which is a rotation over the z-axis.
• color (str) – Color of the scatter points, will be passed to the matplotlib color argu-
ment. The function also accepts c as an alias.
• alpha (float) – Sets the transparency of the surface as 0 <= alpha <= 1, with 0 being
completely transparent.
• kwargs (dict) – Other kwargs accepted are only color as an alias for c and
figsize, if ax is None. Anything else will be ignored.
Returns fig
Return type matplotlib.Figure
Notes
In case an ax shall be passed to the function, note that this plot requires an AxesSubplot, that is capable of
creating a 3D plot. This can be done like:
See also:
SpaceTimeVariogram.scatter()
property tbins
Temporal binning
Returns the bin edges over the temporal axis. These can be used to align the temporal lag class grouping
to actual time lags. The length of the array matches the number of temporal lag classes.
Returns bins – Returns the edges of the current temporal binning.
Return type numpy.array
property tdistance
Time distance
Returns a distance matrix containing the distance of all observation points in time. The time ‘coordiantes’
are created from the values multidimensional array, where the second dimension is assumed to be time.
The unit will be time steps.
Returns tdistance – 1D-array of the upper triangle of a squareform representation of the dis-
tance matrix.
Return type numpy.array
property values
Values
The SpaceTimeVariogram stores (and needs) the observations as a two dimensional array. The first axis
(rows) need to match the coordinate array, but instead of containing one value for each location, the values
shall contain a time series per location.
Returns values – Returns a two dimensional array of all observations. The first dimension
(rows) matches the coordinate array and the second axis contains the time series for each
observation point.
Return type numpy.array
property xbins
Spatial binning
Returns the bin edges over the spatial axis. These can be used to align the spatial lag class grouping to
actual distance lags. The length of the array matches the number of spatial lag classes.
Returns bins – Returns the edges of the current spatial binning.
Return type numpy.array
property xdistance
Distance matrix (space)
Return the upper triangle of the squareform pairwise distance matrix.
Returns xdistance – 1D-array of the upper triangle of a squareform representation of the dis-
tance matrix.
Return type numpy.array
Scikit-GStat implements various semi-variance estimators. These fucntions can be found in the skgstat.estimators
submodule. Each of these functions can be used independently from Variogram class. In this case the estimator is
expecting an array of pairwise differences to calculate the semi-variance. Not the values themselves.
Matheron
skgstat.estimators.matheron()
Matheron Semi-Variance
Calculates the Matheron Semi-Variance from an array of pairwise differences. Returns the semi-variance for
the whole array. In case a semi-variance is needed for multiple groups, this function has to be mapped on each
group. That is the typical use case in geostatistics.
Parameters x (numpy.ndarray) – Array of pairwise differences. These values should be the
distances between pairwise observations in value space. If xi and x[i+h] fall into the h separating
distance class, x should contain abs(xi - x[i+h]) as an element.
Returns
Return type numpy.float64
Notes
This implementation follows the original publication1 and the notes on their application2 . Following the 1962
publication1 , the semi-variance is calculated as:
𝑁 (ℎ)
1 ∑︁
𝛾(ℎ) = * (𝑥)2
2𝑁 (ℎ) 𝑖=1
with:
𝑥 = 𝑍(𝑥𝑖 ) − 𝑍(𝑥𝑖+ℎ )
References
Cressie
skgstat.estimators.cressie()
Cressie-Hawkins Semi-Variance
Calculates the Cressie-Hawkins Semi-Variance from an array of pairwise differences. Returns the semi-variance
for the whole array. In case a semi-variance is needed for multiple groups, this function has to be mapped on
each group. That is the typical use case in geostatistics.
Parameters x (numpy.ndarray) – Array of pairwise differences. These values should be the
distances between pairwise observations in value space. If xi and x[i+h] fall into the h separating
distance class, x should contain abs(xi - x[i+h]) as an element.
Returns
Return type numpy.float64
Notes
This implementation is done after the publication by Cressie and Hawkins from 19803 :
∑︀𝑁 (ℎ)
( 𝑁 1(ℎ) 𝑖=1 |𝑥|0.5 )4
2𝛾(ℎ) = 0.494 0.045
0.457 + 𝑁 (ℎ) + 𝑁 2 (ℎ)
with:
𝑥 = 𝑍(𝑥𝑖 ) − 𝑍(𝑥𝑖+ℎ )
References
Dowd
skgstat.estimators.dowd(x)
Dowd semi-variance
Calculates the Dowd semi-variance from an array of pairwise differences. Returns the semi-variance for the
whole array. In case a semi-variance is needed for multiple groups, this function has to be mapped on each
group. That is the typical use case in geostatistics.
Parameters x (numpy.ndarray) – Array of pairwise differences. These values should be the
distances between pairwise observations in value space. If xi and x[i+h] fall into the h separating
distance class, x should contain abs(xi - x[i+h]) as an element.
Returns
Return type numpy.float64
3 Cressie, N., and D. Hawkins (1980): Robust estimation of the variogram. Math. Geol., 12, 115-125.
Notes
The Dowd estimator is based on the median of all pairwise differences in each lag class and is therefore robust
to exteme values at the cost of variability. This implementation follows Dowd’s publication4 :
2
2𝛾(ℎ) = 2.198 * 𝑚𝑒𝑑𝑖𝑎𝑛(𝑥)
with:
𝑥 = 𝑍(𝑥𝑖 ) − 𝑍(𝑥𝑖+ℎ )
References
Genton
skgstat.estimators.genton()
Genton robust semi-variance estimator
Return the Genton semi-variance of the given sample x. Genton is a highly robust varigram estimator, that is
designed to be location free and robust on extreme values in x. Genton is based on calculating kth order statistics
and will for large data sets be close or equal to the 25% quartile of all ordered point pairs in X.
Parameters x (numpy.ndarray) – Array of pairwise differences. These values should be the
distances between pairwise observations in value space. If xi and x[i+h] fall into the h separating
distance class, x should contain abs(xi - x[i+h]) as an element.
Returns
Return type numpy.float64
Notes
The Genton estimator is described in great detail in the original publication5 and is defined as:
and
(︂ )︂
[𝑁ℎ /2] + 1
𝑘=
2
and
(︂ )︂
𝑁ℎ
𝑞=
2
where k is the kth quantile of all q point pairs. For large N (k/q) will be close to 0.25. For N >=
500, (k/q) is close to 0.25 by two decimals and will therefore be set to 0.5 and the two binomial
coefficients k, q are not calculated.
4 Dowd, P. A., (1984): The variogram and kriging: Robust and resistant estimators, in Geostatistics for Natural Resources Characterization.
References
Shannon Entropy
skgstat.estimators.entropy(x, bins=None)
Shannon Entropy estimator
Calculates the Shannon Entropy H as a variogram estimator. It is highly recommended to calculate the bins and
explicitly set them as a list. In case this function is called for more than one lag class in a variogram, setting
bins to None would result in different bin edges in each lag class. This would be very difficult to interpret.
Parameters
• x (numpy.ndarray) – Array of pairwise differences. These values should be the dis-
tances between pairwise observations in value space. If xi and x[i+h] fall into the h separat-
ing distance class, x should contain abs(xi - x[i+h]) as an element.
• bins (int, list, str) – list of the bin edges used to calculate the empirical distribu-
tion of x. If bins is a list, these values are used directly. In case bins is a integer, as many
even width bins will be calculated between the minimum and maximum value of x. In case
bins is a string, it will be passed as bins argument to numpy.histograms function.
Returns entropy – Shannon entropy of the given pairwise differences.
Return type numpy.float64
Notes
MinMax
Warning: This is an experimental semi-variance estimator. It is heavily influenced by extreme values and outliers.
That behaviour is usually not desired in geostatistics.
skgstat.estimators.minmax(x)
Minimum - Maximum Estimator
Returns a custom value. This estimator is the difference of maximum and minimum pairwise differences,
normalized by the mean. MinMax will be very sensitive to extreme values.
Do only use this estimator, in case you know what you are doing. It is experimental and might change its
behaviour in a future version.
Parameters x (numpy.ndarray) – Array of pairwise differences. These values should be the
distances between pairwise observations in value space. If xi and x[i+h] fall into the h separating
distance class, x should contain abs(xi - x[i+h]) as an element.
Returns
Return type numpy.float64
Percentile
Warning: This is an experimental semi-variance estimator. It uses just a percentile of the given pairwise differ-
ences and does not bear any information about their variance.
skgstat.estimators.percentile(x, p=50)
Percentile estimator
Returns a given percentile as semi-variance. Do only use this estimator, in case you know what you are doing.
Do only use this estimator, in case you know what you are doing. It is experimental and might change its
behaviour in a future version.
Parameters
• x (numpy.ndarray) – Array of pairwise differences. These values should be the dis-
tances between pairwise observations in value space. If xi and x[i+h] fall into the h separat-
ing distance class, x should contain abs(xi - x[i+h]) as an element.
• p (int) – Desired percentile. Should be given as whole numbers 0 < p < 100.
Returns
Return type np.float64
Scikit-GStat implements different theoretical variogram functions. These model functions expect a single lag value or
an array of lag values as input data. Each function has at least a parameter a for the effective range and a parameter c0
for the sill. The nugget parameter b is optinal and will be set to 𝑏 := 0 if not given.
Spherical model
Notes
ℎ3
(︂ )︂
ℎ
𝛾 = 𝑏 + 𝐶0 * 1.5 * − 0.5 *
𝑟 𝑟
if ℎ < 𝑟, and
𝛾 = 𝑏 + 𝐶0
else. r is the effective range, which is in case of the spherical variogram just a.
References
Exponential model
Notes
a is the range parameter, that can be calculated from the effective range r as: 𝑎 = 3𝑟 .
6 Burgess, T. M., & Webster, R. (1980). Optimal interpolation and isarithmic mapping of soil properties. I.The semi-variogram and punctual
References
Gaussian model
Notes
a is the range parameter, that can be calculated from the effective range r as:
𝑟
𝑎=
2
References
Cubic model
• r (float) – The effective range. Note this is not the range parameter! However, for the
cubic variogram the range and effective range are the same.
• c0 (float) – The sill of the variogram, where it will flatten out. The function will not
return a value higher than C0 + b.
• b (float) – The nugget of the variogram. This is the value of independent variable at the
distance of zero. This is usually attributed to non-spatial variance.
Returns gamma – Unlike in most variogram function formulas, which define the function for 2 * 𝛾,
this function will return 𝛾 only.
Return type numpy.float64
Notes
ℎ2 ℎ3 ℎ5 ℎ7
[︂ (︂ )︂ (︂ )︂ (︂ )︂ (︂ )︂]︂
35 7 3
𝛾 = 𝑏 + 𝑐0 * 7 * − * + * − *
𝑎2 4 𝑎3 2 𝑎5 4 𝑎7
a is the range parameter. For the cubic function, the effective range and range parameter are the same.
Stable model
Notes
a is the range parameter and is calculated from the effective range r as:
𝑟
𝑎= 1
3𝑠
Matérn model
Notes
2.7 Changelog
• [OrdinaryKriging]: widely enhanced the class in terms of performance, code coverage and handling.
– added mode property: The class can derive exact solutions or estimate the kriging matrix for high perfor-
mance gains
– multiprocessing is supported now
– the solver property can be used to choose from 3 different solver for the kriging matrix.
• [OrdinaryKriging]: calculates the kriging variance along with the estimation itself. The Kriging variance can
be accessed after a call to OrdinaryKriging.transform and can be accessed through the OrdinaryKrig-
ing.sigma attribute.
2.7. Changelog 99
SciKit GStat Documentation, Release 0.2.8
• added SpaceTimeVariogram for calculating dispersion functions depending on a space and a time lag.
• [severe bug] A severe bug was in Variogram.__vdiff_indexer was found and fixed. The iterator
was indexing the Variogram._diff array different from Variogram.distance. This lead to wrong
semivariance values for all versions > 0.1.8!. Fixed now.
• [Variogram] added unit tests for parameter setting
• [Variogram] fixed fit_sigma setting of 'exp': changed the formula from 𝑒( 𝑥 ) to 1. − 𝑒( 𝑥 ) in order to
1 1
increase with distance and, thus, give less weight to distant lag classes during fitting.