Applying Machine Learning Methods To Predict Geology Using Soil Sample Geochemistry
Applying Machine Learning Methods To Predict Geology Using Soil Sample Geochemistry
A R T I C L E I N F O A B S T R A C T
Keywords: In this study we compared various machine learning techniques that used soil geochemistry to aid in geologic
Geological mapping mapping. We tested six different sampling methods (undersample, oversample, Synthetic Minority Oversampling
Soil geochemistry Technique (SMOTE), Adaptive Synthetic Sampling (ADASYN), SMOTE and Edited Nearest Neighbor (SMO
Data science
TEENN), and SMOTE and Tomek links (SMOTETomek)). SMOTE performed best with ADASYN and SMOTE
Machine learning
Sampling method
Tomek having slightly lower effectiveness. Nine machine learning algorithms (naïve Bayes, logistic regression,
Multiple classifier system quadratic discriminant analysis, nearest neighbors, radial basis function support-vector machine, artificial neural
network, random forest, AdaBoost classifier, and gradient boosting classifier) were compared and AdaBoost
classifiers and gradient boosting classifiers were found to be most effective. Finally, we experimented with
multiple classifier systems (MCS) testing different combinations of algorithms and various combinatorial func
tions. It was found that MCS can outperform individual models, and the best MCS combined nearest neighbors,
radial basis function support-vector machine, artificial neural network, random forest, AdaBoost classifiers, and
gradient boosting classifier, then applied a logistic regression to the probabilities output by the models. Ulti
mately, we created a tool that is able to adequately predict underlying geology in the study area using soil
geochemistry.
* Corresponding author. Department of Geological Sciences, Stanford University, Stanford, CA, 94305, USA.
E-mail address: [email protected] (T.C.C. Lui).
https://doi.org/10.1016/j.acags.2022.100094
Received 4 February 2021; Received in revised form 20 July 2022; Accepted 3 August 2022
Available online 11 August 2022
2590-1974/© 2022 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-
nc-nd/4.0/).
T.C.C. Lui et al. Applied Computing and Geosciences 16 (2022) 100094
2. Study area rocks, marble, carbonaceous schist, quartzite, and psammitic schist.
The Snowcap assemblage is intruded by tonalite-granodiorite arc rocks
Our study location is at the Klaza deposit located in the southern tip of the Devonian-Mississippian-age Finlayson assemblage (Piercey and
of the Dawson Range, southcentral Yukon (Fig. 1). Basement rocks Colpron, 2009; Piercey et al., 2006). The Simpson Range suite is pre
comprise Devonian and older metamorphosed volcaniclastic and sedi dominantly comprised of biotite-hornblende granitic to granodioritic
mentary rocks of the parautochthonous Yukon-Tanana terrane (YTT). rocks of Late Devonian to Early Carboniferous-age which cut the prior
Locally at Klaza, the YTT comprises the Finlayson assemblage, Simpson volcano-sedimentary assemblages (Piercey and Murphy, 2000).
Range suite, and the Snowcap assemblage. The Snowcap assemblage is The YTT accreted to the Laurentian margin between the Late Triassic
the oldest assemblage in the YTT and mainly comprises calc-silicate to Early Jurassic during east-dipping subduction (Allan et al., 2013;
Colpron et al., 2006; Nelson et al., 2006, 2013; Piercey et al., 2006). The
Table 1 regionally extensive Long Lake suite and Minto suite granodiorite to
Sampling methods tested and descriptions. monzonite plutons were emplaced during this subduction event.
Sampling Method Description East-dipping subduction resumed in the mid-Cretaceous, leading to the
emplacement of the Whitehorse plutonic suite (112–98 Ma) and the
No change No sampling method is used. Imbalance is
maintained.
coeval Mount Nansen andesitic volcanic package (115–110 Ma)
Random Undersampler Randomly and without replacement, select from (Klöcking et al., 2016). The regionally extensive Dawson Range batho
the majority classes down to the size of the lith consists of multiphase granodiorite to monzogranite of the White
smallest class (Batista et al., 2004). horse plutonic suite. Continued east-dipping subduction in the Late
Random Oversampler Randomly duplicate with replacement samples
Cretaceous led to the emplacement of the Casino plutonic suite (78–74
from minority classes up to the size of the largest
class (Batista et al., 2004). Ma), which is composed of rhyolite to dacite dikes and plugs (Allan
Synthetic Minority Oversampling SMOTE is an algorithm that oversamples by et al., 2013; Nelson et al., 2013; Sack et al., 2021). The Casino plutonic
Technique (SMOTE) synthetically creating data points. Apply suite is less voluminous than the Whitehorse suite, although it occurs
SMOTE to all minority classes up to the size of regionally throughout the Dawson Range Gold Belt in the form of
the largest class (Batista et al., 2004).
Adaptive Synthetic Sampling ADASYN is a distinct algorithm that
porphyritic dykes and stocks.
(ADASYN) synthetically oversamples. Apply ADASYN to all
minority classes up to the size of the largest 3. Methods
class. (He et al., 2008)
SMOTE and Edited Nearest Edited Nearest Neighbor (ENN) is a cleaning
3.1. Data acquisition
Neighbor (SMOTEENN) technique that removes samples based on
dissimilarity to nearby samples. First apply
SMOTE to all minority classes up to the size of Soil samples have been collected from 1986 to 2017 on the property.
the largest class, then use ENN to remove Soil was retrieved between 30 and 80 cm deep using hand-held augers.
potentially inadequate samples (Batista et al., Collection was done in grids, and location was recorded using a hand-
2004).
SMOTE and Tomek links Tomek links is a cleaning technique that remove
held Global Positioning System (GPS) or prior to widespread, precise
(SMOTETomek) samples based on distances between dissimilar GPS availability, with hip chain and compass bearing to a grid baseline.
samples. First apply SMOTE to all minority These samples were dried and then screened so only fine-grained ma
classes up to the size of the largest class, then use terial was used which is considered to be more geochemically repre
Tomek links to remove potentially inadequate
sentative than coarse-grained material (Grunsky and de Caritat, 2019).
samples (Batista et al., 2004).
2
T.C.C. Lui et al. Applied Computing and Geosciences 16 (2022) 100094
Fig. 1. Location of study area, Klaza Property, Yukon, Canada. Georeferenced to UTM zone 8N projection using the WGS-84 datum. Each dot represents a soil sample
within the study area. The inset map shows the relative position of the study area in relation to provinces in northwestern Canada.
Then, they were analyzed in laboratories using a combination of fire rarely reported or had very low variance were also removed. Those
assay fusion, atomic absorption spectrometry, aqua regia digestion, and removed due to low variance were at least 75% the exact same value and
inductively coupled plasma-atomic emission spectrometry. It is impor arose due to the detection limits, hence we wanted to avoid an artifact
tant to note that geochemical analyses are not able to perfectly capture from the data collection process. Additionally, only one feature was kept
the complete geochemical picture. For example, different acid digestion when features were highly correlated, which we defined as having a
methods may not be able to dissolve certain minerals (Balaram and Pearson correlation coefficient greater than 0.9 or lesser than − 0.9. Ca,
Subramanyam, 2022; Grunsky and de Caritat, 2019). However, these Co, Cr, and Fe were dropped for these reasons. When geochemical values
techniques are industry standards and incomplete data will always be were missing, these were imputed with the lowest concentration value
present, so we are also testing the effectiveness of these methods despite for the element for the year in which it was collected.
the imperfect picture. Ultimately our data base contains 11 832 soil To avoid soil samples that might be labelled with the wrong geology
samples with their respective GPS location and 51 elemental due to local errors in position of geologic contacts, we created a 50 m
abundances. buffer in both directions from contacts between geologic units. Soil
We obtained a High-Resolution Digital Elevation Model (HRDEM) samples within the buffer were withdrawn from the dataset. This
from Natural Resources Canada, which provided us with elevation data removed some of the uncertainty related to inferring geologic unit
and allowed creation of slope and aspect maps of the region. We addi contacts as well as related to the mixing of soils between geologic units.
tionally obtained the most recent geologic map for the study area from The distribution of soil samples per geologic unit can be found in
the Yukon Geological Survey. The mapping took 40 days and was con Table 3. Geologic units PDS, MSR, and LKP only had 20, 17, and 8 soil
ducted in 2019 and 2020. This map compiles various geologic maps samples respectively, so we removed them from the dataset since there
published by geological surveys as well as from geologic companies, and was not sufficient data to effectively train and test on. In total, 6605 soil
additionally considers results from geochronological data. (Sack et al., samples and 23 elemental abundances were discarded. This resulted in
2021). Using these we spatially combined the three sources of data what we called our full dataset comprising of 5227 soil samples and 31
creating a data file that included 11 832 soil samples, their chemical features (concentrations of Ag, Al, As, Au, Ba, Be, Bi, Cd, Cu, Ga, K, La,
abundances for 51 elements, the elevation, slope, and aspect at those
points, and the corresponding geological unit.
Table 3
Number of soil samples available for each geologic unit.
3.2. Data pre-processing
Geologic Unit Abbreviation Number of Samples
The data were then checked for robustness and it was determined Whitehorse Suite mKW 3194
Mount Nansen Group mKN 933
that some of the data was inadequate and not all of it useable. Data
Minto Suite LTrEJM 518
collected before 2010 was discarded since analytical techniques were Casino Suite LKC 370
older and detection limits were different, hence all data we used was Finlayson Assemblage DMF 212
collected between 2010 and 2017. We applied an additive log ratio Snowcap Assemblage PDS 20
transform to the data using Ti as a divisor. Some elements (B, Ce, Cs, Ge, Simpson Range Suite MSR 17
Prospector Mountain LKP 8
Hf, Hg, In, Li, Nb, Rb, Re, Se, Sn, Ta, Te, Tl, W, Y, and Zr) which were
3
T.C.C. Lui et al. Applied Computing and Geosciences 16 (2022) 100094
Mg, Mn, Mo, Na, Ni, P, Pb, S, Sb, Sc, Sr, Th, Ti, U, V, and Zn, elevation, Table 4
slope, and aspect). Hyperparameters tested.
Additionally, we created three polygons, and samples within these Algorithm Hyperparameters
formed an external testing set. This allowed us to test the performance of
NB Variance smoothing
our models in an area that was not part of the train/test split. These LR Penalty [l1 l2], C
external testing sets are far from each other, are on the edges of soil QDA Tolerance, regularization parameter
sample clusters to minimize spatial correlation with samples in the NN Number of neighbors, weights [uniform, distance], algorithm [ball tree,
training set, and are near geologic contacts to test model performance kd tree], leaf size
RBFSVM C, gamma, tolerance
around contacts. They cover four out of five of the labels in this study ANN Hidden layer sizes, alpha, activation function [identity, logistic, tanh,
since the excluded label DMF is all within one spatial cluster with no relu]
nearby contacts. 298 samples were in these external testing polygons, RF Max depth, minimum samples to split, number of estimators, criterion
hence the dataset we used to train and test our methods was composed of [gini, entropy]
AB Learning rate, number of estimators, base estimators [various decision
4929 soil samples and 31 features.
tree classifiers]
GB Number of estimators, learning rate, minimum samples to split,
minimum samples in leaf, maximum depth
3.3. Machine learning methods
Analyses were done using Python 3.7.3 and packages scikit-learn techniques were tested. They are described in Table 1. We reran the code
0.23 (Pedregosa et al., 2011) and imblearn 0.5 (Lemaitre et al., 2017). with ten random seeds. After comparing the performance of the sam
First, we applied a centered log-ratio transform to the geochemical data pling methods, it appeared that SMOTE was the best technique, so we
to avoid issues with spurious correlations when working with compo used it for the rest of the study.
sitional data (Aitchison, 1982). In contrast to other log-ratio transforms, Next, to test which supervised machine learning algorithm is best for
the centered log-ratio does not drop a feature in the process, making the this problem, we ran nine different machine learning algorithms: logistic
transformed data easier to interpret (Grunsky et al., 2017). Before regression (LR), quadratic discriminant analysis (QDA), nearest neigh
training the algorithms, the dataset was divided by applying an 80–20 bors (NN), radial basis function support-vector machine (RBFSVM),
train-test split. Additionally, we stratified it, which means there was an naïve Bayes (NB), artificial neural network (ANN), random forest (RF),
equal proportion of imbalanced class samples in the training and testing AdaBoost classifier (AB), and gradient boosting classifier (GB). Simple
set. The training set was scaled and normalized using a z-score, and then descriptions and references can be found in Table 2. We ran code that
normalization parameters were applied to the testing set. A 5-fold cross used SMOTE as its sampling method and tested the performance of the
validation loop was applied upon training which further increased the nine distinct algorithms. This code was rerun with ten random seeds.
testing rigor and reduced overfitting. For hyperparameter selection we Additionally, we tested whether a multiple classifier system could be
performed a randomized search with 100 iterations over the hyper useful. Multiple classifier systems ensemble distinct models and form
parameter space. Numerical hyperparameters were tested in exponen predictions that combine results of the independent models (Ranawana
tial increments. Fig. 2 visualizes the data division processes and Table 4 and Palade, 2006). We tested multiple combinatorial functions
shows the hyperparameters tested. described in Table 5. We also tested different groups of algorithms to
To compare results, models were evaluated using F1 micro scores
and F1 macro scores. The F1 score is a metric that evaluates the per Table 5
formance of a model and takes into account the number of true positives, Combinatorial functions tested.
true negatives, false positives, and false negatives predicted. It is the
Combinatorial Description
harmonic mean of precision and recall. An F1 score was calculated for all Function
classes. The F1 macro score gets the F1 score for each class and then
SUM Adds the probabilities of the models. Prediction is the
averages the F1 score across all classes weighing each class equally
geologic unit with highest summed probability.
regardless how many samples are in each class (Opitz and Burst, 2019). A-LR Train a logistic regression on the probabilities of the training
In contrast, the F1 micro scores does not weigh classes equally. A benefit set.
to the F1 macro score is that all geologic units are of equal importance B-LR Applies SMOTE to probabilities of the training set, then trains
a logistic regression on it.
regardless of how many soil samples were collected within them, while
C-LR Scales the probabilities of the training set with a z-score, then
the F1 micro score will inform about the performance of the model over trains a logistic regression.
the whole study area. D-LR Scales the probabilities of the training set with a z-score, then
Since the dataset is imbalanced as seen in Table 3, we explored what applies SMOTE, then trains a logistic regression on it.
sampling method best dealt with the imbalance problem. Six sampling
Fig. 2. Flowchart summarizing division of data and purposes. This data division process was repeated with 10 random seeds to account for randomness.
4
T.C.C. Lui et al. Applied Computing and Geosciences 16 (2022) 100094
combine as described in Table 6. We then tested all the combinations All MCS except for A-LRMCS2, C-LRMCS2, and SUMMCS9 were able to
between combinatorial functions and groups of algorithms and reran the significantly outperform GB (Fig. 8). The highest improvement came
process with 10 random seeds. from D-LRMCS6 with a F1 micro score of 0.956 ± 0.006 which was on
average 0.013 ± 0.005 above GB. Our results suggest that the over
4. Results sampling step improves the performance of the LRMCS. The two
methods that did not use SMOTE, A-LRMCS and C-LRMCS, consistently
Fig. 3 compares the performance of the different sampling methods perform more poorly than their SMOTE counterparts, B-LRMCS and D-
on the 9 models tested. Additionally Fig. 4 compares the per seed dif LRMCS. We can also observe some general trends regarding MCS
ference between using sampling methods and not. Since the code was behavior. As MCS number increases, SUMMCS and LRMCS tend to
rerun using multiple random seeds, it is important to compare differ improve in performance. After MCS number 6, performance decreases as
ences in performance within the same random seed so that algorithms MCS number increases, and SUMMCS decreases at a faster rate than
are compared under the same training and testing sets. Looking at LRMCS.
Fig. 3a first, for all models the Random Undersampler performed To help visualize the effectiveness of the predictions made by D-
significantly worse and had on average 0.10 lower F1 micro score than LRMCS6, the best performing classifier, we plotted the predictions for
not using a sampling method. Decision tree based methods, RF, AB, and each soil sample on top of the geologic map (Fig. 9). Additionally,
GB, all tended to perform better with oversampling-style techniques. Fig. 10 shows the various predicted probabilities for each geologic unit
Fig. 4a, shows that Random Oversampler performed 0.010 ± 0.009 which helps quantify how confident the model was regarding its pre
better in RF when compared to not using a sampling method. SMOTE dictions. This figure shows random seed 15 with an F1 micro score of
performed 0.030 ± 0.010 better in RF, 0.019 ± 0.007 in AB, and 0.006 0.955 which was the seed with performance closest to the average.
± 0.006 in GB. ADASYN performed 0.030 ± 0.008 better in RF and Table 7 shows a confusion matrix for that seed’s testing set. Table 8
0.021 ± 0.007 in AB. Finally, using SMOTETomek led to an increase in shows a confusion matrix for the external testing sets. The F1 micro
F1 micro score of 0.031 ± 0.010 in RF, and 0.021 ± 0.007 in AB. score for the external testing set was 0.507. Overall, the model predicted
Fig. 3b provides an alternate perspective on model performance well, and many simple boundaries were clear. The algorithm predicted
where all geologic units are weighted equally instead of each being the Finlayson Assemblage and Whitehorse Suite very well, while Casino
weighted by abundance of analyses. RF, AB, and GB all greatly improve Suite had the worst performance. Casino Suite was mostly incorrectly
F1 macro score when oversampling-style techniques are used. Notably, predicted as Minto Suite and Whitehorse Suite. Nothing was mis
when compared to not using a sampling method, SMOTE increases RF’s classified as Finlayson Assemblage.
F1 macro score by 0.080 ± 0.020, AB by 0.035 ± 0.012, GB by 0.020 ±
0.012; ADASYN increases RF by 0.079 ± 0.018, AB by 0.039 ± 0.013, 5. Discussion
and GB by 0.016 ± 0.013; and SMOTETomek increases RF by 0.080 ±
0.022, AB by 0.039 ± 0.012, and GB by 0.020 ± 0.013 (Fig. 4b). 5.1. Comparing sampling methods
The performance of the 9 different machine learning algorithms
varied significantly (Fig. 5). We can also compare each pair of algo Our results provide insights into the differences in performance be
rithms and their performances within the same seed. Fig. 6 shows a tween various sampling methods. Figs. 3 and 4 show that Random
heatmap comparing the average differences in performance in F1 micro Undersampler is not a good technique to use when predicting underlying
score between all algorithms. The best performing models were AB and geology from soil sample geochemistry, and it is better to simply keep
GB with F1 micro scores of 0.946 ± 0.007 and 0.943 ± 0.007 respec imbalance as with the “no change” method. For all models, performance
tively. RF, RBFSVM, and ANN were slightly lower but still comparable at was significantly lower than “no change” with scores far outside the
0.935 ± 0.006, 0.932 ± 0.009, and 0.922 ± 0.008 respectively. error bars (on average 0.10 lower F1 micro score than “no change”). For
Observing Fig. 6, we can see that AB and GB significantly perform better decision tree ensembling models we see that oversampling-style tech
than these algorithms. Performing worse were NN at 0.887 ± 0.008, niques are useful, with SMOTE, ADASYN, and SMOTETomek working
QDA at 0.843 ± 0.011, and LR at 0.794 ± 0.011. NB performed worst the best. These techniques help increase F1 micro score, but importantly
with a F1 micro score of 0.693 ± 0.009. they also considerably increase F1 macro score. The oversampling al
Fig. 7 shows a comparison of performance between the MCS we gorithms predict the less abundant units better, hence the low increase
tested. Since the code was rerun using multiple random seeds, Fig. 8 in F1 micro score but higher increase in F1 macro score. For our next
compares the differences in performance within the same random seed highest performing models, RBFSVM and ANN, these oversampling
between the various MCS architectures and GB, the best individual al techniques did not significantly affect the model’s performance. Since
gorithm. Information about MCS number and architecture is found in three out of the five best models improve significantly due to SMOTE
Tables 5 and 6 We can see that MCS can outperform individual models. (RF’s F1 micro score increased by 0.030 ± 0.010, AB by 0.019 ± 0.007;
RF’s F1 macro score increased by 0.080 ± 0.020, AB by 0.035 ± 0.012,
GB by 0.020 ± 0.012), and the other two remain statistically similar (F1
Table 6
micro score difference between SMOTE and no sampling method was
Groups of algorithms tested.
− 0.004 ± 0.006 for RBFSVM and − 0.007 ± 0.010 for ANN), SMOTE
MCS Algorithms Combined Reasoning was thus decided to be the best sampling method for this project.
Name
ADASYN and SMOTETomek were also suitable sampling methods but
MCS2 RBFSVM, GB Top two algorithms with no repetition of given the increased complexity and runtime of the method and the low
decision tree ensemble methods
payoffs, SMOTE was chosen instead. Overall, we see the value sampling
MCS3 RBFSVM, ANN, GB Top three algorithms with no repetition
of decision tree ensemble methods methods could provide based on the algorithm used and suggest that
MCS5 RBFSVM, ANN, RF, AB, GB Top five algorithms. SMOTE is most effective for similar studies (with ADASYN and SMO
MCS6 NN, RBFSVM, ANN, RF, AB, Top six algorithms. TETomek being good alternatives as well).
GB
MCS7 QDA, NN, RBFSVM, ANN, Top seven algorithms.
RF, AB, GB 5.2. Machine learning algorithms compared
MCS8 LR, QDA, NN, RBFSVM, Top eight algorithms.
ANN, RF, AB, GB Using our previously suggested sampling method, SMOTE, we
MCS9 NB, LR, QDA, NN, RBFSVM, All algorithms compared the performance of various algorithms in Figs. 5 and 6. RF,
ANN, RF, AB, GB
AB, and GB performed best showing that ensemble decision tree
5
T.C.C. Lui et al. Applied Computing and Geosciences 16 (2022) 100094
Fig. 3. Comparison of performance of different sampling methods per algorithm, (a) using F1 micro score, (b) using F1 macro score. Error bars represent one
standard deviation after rerunning using ten random seeds.
methods tend to be effective algorithms for similar projects. There is no studies that predicted geologic mapping suggested random forests was
significant difference between the performance of AB (0.946 ± 0.007) the best algorithm (Cracknell and Reading, 2014). Our results support
and GB (0.943 ± 0.007). Both models performed significantly better the notion that random forests is a strong algorithm for these types of
than RF (0.011 ± 0.005 and 0.008 ± 0.007 respectively). Hence, our projects, and provides further evidence that more complex decision tree
results suggest that using the more advanced ensemble decision tree ensemble techniques like GB and AB are effective.
methods can improve performance when compared to RF. The runtime
of AB is only about 1.5 times longer than RF, while GB required about 5.3. Performance of multiple classifier systems
5.5 times longer, and given similar performances AB might be preferred
over GB. RBFSVM (0.932 ± 0.009) and ANN (0.922 ± 0.008) also per Our results showed that MCS can perform better than one of our best
formed well, but not as well as AB and GB. The remaining four algo singular algorithms, GB (Figs. 7 and 8). B-LRMCS and D-LRMCS applied
rithms performed significantly lower (NN 0.887 ± 0.008, QDA 0.843 ± SMOTE first to balance the data, while A-LRMCS and C-LRMCS did not.
0.011, LR 0.794 ± 0.011, NB 0.693 ± 0.009). For comparison, a trivial Given that B-LR and D-LR outperformed A-LR and C-LR, it is evident that
classifier has been added which always predicts the most abundant class applying a sampling method such as SMOTE at this stage increases the
in the training set and its F1 micro score is 0.641. All models at least performance of the MCS. A-LR and B-LR did not scale probabilities
perform better than the trivial classifier, showing that the soil dataset is before running the logistic regression, while C-LR and D-LR did. Dif
informative about the underlying rock. ferences within these groups are minimal and within each other’s error
Our results support the notion that support vector machines, neural bars, so it is unlikely that scaling at this stage will significantly affect the
networks, and random forests are strong algorithms. Other comparison classifiers.
6
T.C.C. Lui et al. Applied Computing and Geosciences 16 (2022) 100094
Fig. 4. Average difference of performance between using a sampling method versus not for each seed, (a) using F1 micro score, (b) using F1 macro score. Error bars
represent one standard deviation after rerunning using ten random seeds.
Observing the MCS number (number of individual classifiers 0.006) and outperformed one of the best individual models GB (0.943 ±
included in the multiple classifier system), we can find trends that 0.007). However, the F1 micro score increase was relatively small, the
inform us about the performance of our models. As MCS number run time was longer, and the increase in complexity is large. Five
increased, MCS performance tended to improve until MCS number 6. additional models needed to be trained to create D-LRMCS6, and
After this point, larger MCS numbers decreased model performance, and training D-LRMCS6 takes about twice as long as only training GB.
SUMMCS was affected more drastically than LRMCS. This suggests that Nonetheless, increasing performance when performance is already high
the quality of the models combined is more important than quantity of is difficult. Therefore, whether it is better to use GB or create the MCS
models combined. When MCS number is low, the increase in different models will depend on the needs of the user.
algorithms and perspectives provides positive impact on MCS perfor
mance, but once we start adding poorly performing models like LR and 5.4. External testing set performance
NB, these lower-quality models do not add to the predictive ability and
have a negative impact on performance. SUMMCS is more susceptible to Predictions of one seed of D-LRMCS6 are mapped in Figs. 9 and 10,
this negative effect since it treats all models equally and simply adds which aid in the visualization of the performance of our techniques. In
probabilities. This is seen since the SUMMCS score decreases drastically general, the classifier effectively predicted the underlying geology with
compared to the LRMCS after MCS number 6. These LRMCS successfully a testing set F1 micro score of 0.955. Through detailed investigation of
use a logistic regression to detect which algorithms are working best and the classifier’s performance some observations can be made. Casino
minimize the impact of poorly performing algorithms. Suite was the most poorly predicted unit. It tended to be predicted as
In this study, D-LRMCS6 was the best performing model (0.956 ± Whitehorse Suite and Minto Suite, likely because the lithologies are
7
T.C.C. Lui et al. Applied Computing and Geosciences 16 (2022) 100094
Fig. 5. Comparison of performance of 9 algorithms. Trivial classifier always predicts points as the most abundant geologic unit in the training set. Error bars
represent one standard deviation after rerunning using ten random seeds.
Fig. 6. Heatmap comparing average algorithm F1 micro score difference between two algorithms. Uncertainty ranges represent one standard deviation after
rerunning using ten random seeds.
8
T.C.C. Lui et al. Applied Computing and Geosciences 16 (2022) 100094
Fig. 7. Comparison of performance of various MCS. MCS number represents the number of algorithms combined which are explained in more detail in Table 6. Each
bar represents a different MCS architecture explained in Table 5. GB is plotted for comparison since it is the best performing independent algorithm. Error bars
represent one standard deviation after rerunning using ten random seeds.
Fig. 8. Average difference of performance between MCS architectures and GB for each seed. Error bars represent one standard deviation after rerunning using ten
random seeds.
similar. Casino Suite is made up of rhyolite to dacite which is very amphibolites which are significantly geochemically different to the
compositionally similar to Whitehorse Suite and Minto Suite which are igneous rocks that dominate the study area. For example, samples from
both hornblende granodiorites. The similar chemical compositions of this geologic unit tended to have significantly higher signals in Mg, Ni,
the rocks could explain why the chemistry of the soil, which formed as Cr, and Cu, and lower signals in Ca, P, S, and Na, when compared to the
products of their weathering, is not effective at distinguishing between other units. These differences were easily picked up by our algorithms.
the three lithologies. The model might have performed better if it tried Performance within the external testing set was lower than for the
to predict by lithological type and these units were grouped together, internal testing set. Note, however, that some important boundaries are
however, Casino suite is related to gold mineralization so we wanted to being detected by our models (Figs. 9 and 10). In inset A, we can see that
see if this unit could be identified and distinguished from similar rock. In the model predicts the green, mKN polygon and boundary well, despite
contrast, our model best classifies samples from the Finlayson Assem there being no nearby mKN points that were part of the training set.
blage. This is likely because the geologic unit is composed of Additionally, inset B shows a complex external testing set with many
9
T.C.C. Lui et al. Applied Computing and Geosciences 16 (2022) 100094
Fig. 9. Map comparing prediction made by D-LRMCS6 and a geologic map for the Klaza Property. F1 micro score was 0.962 on the full data set, 0.955 on the internal
testing set, and 0.507 on the external testing set. We chose to plot the random seed with performance closest to average. Each dot represents the position of a soil
sample, and the color of the dot indicates what geologic unit was predicted for that soil sample by D-LRMCS6. These dots lie on top of polygons for the geologic map.
External testing set polygons are outlined in the map and shown in more detail in the insets. (For interpretation of the references to color in this figure legend, the
reader is referred to the Web version of this article.)
geologic units and convoluted boundaries. Our model predicts the green, are used to train the algorithm. Similarly, one could do cross-validation
mKN points pretty well, and those that it ultimately gets wrong tend to based on east to west position, e.g. testing on the 10% eastmost samples,
have 0.1–0.5 probability of being mKN (Fig. 10d). It is also able to training on the 90% westmost samples, and then repeating with other
predict the boundary with the red, LKC unit (Figs. 9 and 10a). As groups. Although that would help with the spatial autocorrelation, our
mentioned before, Casino Suite (LKC) is geochemically similar to Minto study area is too spatially complex and imbalanced for it to work
Suite (LTrEJM) and Whitehorse Suite (mKW). This could explain why properly. There would be many spatial clusters absent of some of the
the northeastern part of inset B is predicted more poorly, often mis geologic units, and we would not be able to train and test adequately.
classifying samples between these three units. Our methods are able to test on areas spatially distant from our training
The F1 score for the external testing sets was 0.507 while the internal set, cover difficult areas with geologic boundaries, and include four out
testing set F1 micro score was 0.955. For comparison, we can consider of five geologic units in our study area. Although the external test set did
an alternate form of the trivial classifier. Instead of always predicting the not include any Finlayson Assemblage, our best model never improperly
most abundant class in the training set, (which would be inadequate classified something in the external test set as Finlayson Assemblage.
since the external testing set is imbalanced), we can compare perfor Despite the external test set performance ultimately being low, our re
mance to a classifier that would classify through random selection which sults and comparisons are still meaningful. Algorithms that performed
would have an expected F1 score of 0.2. We can then see that our better on the internal test set tended to also have better performances in
technique is drastically better than a trivial classifier. The difference the external test set, so performance differences in the internal test set do
between external and internal testing set performance is expected since hold value for external test set performance.
data from the external testing set were spatially distant from data in the Additional sources of error include potential misclassifications in the
training set, and it seems like our models were not able to generalize too map itself. Predicted lithologies could possibly be correct but recorded
well to external soil samples. Part of the reason why internal testing set as incorrect due to slight errors where the contacts have been inter
performance is so high is also because neighboring points might be part preted under cover. In addition to this, contact areas, especially on
of the training set, and hence the model is likely to predict correctly due slopes, are likely to contain contributions from multiple rock types
to spatial correlations. However, it is inherently difficult to predict and sourced from further up-slope. This will alter the chemistry of the
measure performance on a study area like this. resulting soil making it more difficult to predict the underlying geology.
We tried to mitigate the impacts of these by creating the 50m buffer in
both directions of geologic contacts but in some areas this may not have
5.5. Additional considerations been sufficient.
The study area is inherently challenging, and geologic maps for the
A possible solution to the spatial correlation issue is to create poly region have been updated multiple times throughout the past 5 years.
gons to split the data into regional chunks, then perform cross-validation New geologic units were introduced recently, and geochronological data
where one region is used to test the algorithm, while the other regions
10
T.C.C. Lui et al. Applied Computing and Geosciences 16 (2022) 100094
Fig. 10. Maps showing prediction probabilities made by D-LRMCS6 per geologic unit. We chose to plot the random seed with performance closest to average.
External testing set polygons are outlined in the map and shown in more detail in the insets. Each map presents prediction probabilities for a different geologic unit.
Each dot represents the position of a soil sample, and the color of the dot indicates what probability D-LRMCS6 predicted for that soil sample to be the respective
geologic unit. These dots lie on top of polygons for the geologic map. External testing set polygons are outlined in the map and shown in more detail in the insets. (For
interpretation of the references to color in this figure legend, the reader is referred to the Web version of this article.)
has led to remapping of some units. Were the goal to obtain good pre geochemically similar rock.
diction performance, we could have grouped geologic units based on
lithological rock types, hence reducing the misclassification of Casino 6. Concluding remarks
Suite, Minto Suite, and Whitehorse Suite. However, we chose to predict
down to the geologic unit because only specific geologic units are Through this study we analyzed the effectiveness of multiple ma
associated with mineralization (Casino Suite), and the goal was to chine learning models on predicting underlying geology using soil
determine if our models would be able to distinguish these from properties. We conclude that machine learning can be an effective tool
11
T.C.C. Lui et al. Applied Computing and Geosciences 16 (2022) 100094
Table 7 Gregory and the NSERC CGS-M to Timothy Chee Cheng Lui.
Confusion matrix for testing set predictions of D-LRMCS6 seed 15.
Predicted Geologic Unit Declaration of competing interest
DMF LKC LTrEJM mKN mKW
The authors declare that they have no known competing financial
Actual Geologic Unit DMF 37 0 0 0 5
interests or personal relationships that could have appeared to influence
LKC 0 41 4 2 14
LTrEJM 0 0 92 3 3 the work reported in this paper.
mKN 0 1 0 145 7
mKW 0 2 1 8 621 Acknowledgements
12
T.C.C. Lui et al. Applied Computing and Geosciences 16 (2022) 100094
the Eastern Goldfields of Australia. Geophysics 83 (4), B183–B193. https://doi.org/ Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E., 2011. Scikit-learn: machine
10.1190/geo2017-0590.1. learning in Python. J. Mach. Learn. Res. 12, 2825–2830.
Kuhn, S., Cracknell, M.J., Reading, A.M., 2019. Lithological mapping in the central Piercey, S.J., Colpron, M., 2009. Composition and provenance of the Snowcap
african copper Belt using random forests and clustering: strategies for optimised assemblage, basement to the Yukon-Tanana terrane, northern cordillera:
results. Ore Geol. Rev. 112, 103015 https://doi.org/10.1016/j. implications for cordilleran crustal growth. Geosphere 5, 439–464. https://doi.org/
oregeorev.2019.103015. 10.1130/GES00505.S3.
Kuhn, S., Cracknell, M.J., Reading, A.M., Sykora, S., 2020. Identification of intrusive Piercey, S.J., Murphy, D.C., 2000. Stratigraphy and regional implications of unstrained
lithologies in volcanic terrains in British Columbia by machine learning using Devono-Mississippian volcanic rocks in the Money Creek thrust sheet, Yukon-Tanana
random forests: the value of using a soft classifier. Geophysics 85 (6), B235–B244. Terrane, southeastern Yukon. In: Emond, D.S., Weston, L.W. (Eds.), Yukon
https://doi.org/10.1190/geo2019-0461.1. Exploration and Geology 1999. Exploration and Geological Sciences Division, Yukon
Lemaitre, G., Nogueira, F., Aridas, C.K., 2017. Imbalanced-learn: a Python toolbox to Region, Indian and Northern Affairs Canada, pp. 67–78.
tackle the curse of imbalanced datasets in machine learning. J. Mach. Learn. Res. 18 Piercey, S.J., Nelson, J.L., Colpron, M., Dusel-Bacon, C., Simard, R.-L., Roots, C.F., 2006.
(17), 1–5. http://jmlr.org/papers/v18/16-365.html. Paleozoic magmatism and crustal recycling along the ancient Pacific margin of North
Masoumi, F., Eslamkish, T., Abkar, A.A., Honarmand, M., Harris, J.R., 2017. Integration America, northern Cordillera. In: Colpron, M., Nelson, J.L. (Eds.), Paleozoic
of spectral, thermal, and textural features of ASTER data using Random Forests Evolution and Metallogeny of Pericratonic Terranes at the Ancient Pacific Margin of
classification for lithological mapping. J. Afr. Earth Sci. 129, 445–457. https://doi. North America, Canadian and Alaskan Cordillera. Geological Association of Canada,
org/10.1016/j.jafrearsci.2017.01.028. Special Paper 45, pp. 281–322.
Nelson, J.L., Colpron, M., Piercey, S.J., Dusel-Bacon, C., Murphy, D.C., Roots, C.F., 2006. Ranawana, R., Palade, V., 2006. Multi-classifier systems-review and a roadmap for
Paleozoic tectonic and metallogenic evolution of the pericratonic terranes in Yukon, developers. Int. J. Hybrid Intell. Syst. 3, 35–61. https://doi.org/10.3233/HIS-2006-
northern British Columbia and eastern Alaska. In: Colpron, M., Nelson, J.L. (Eds.), 3104.
Paleozoic Evolution and Metallogeny of Pericratonic Terranes at the Ancient Pacific Sack, P., Eriks, N., van Loon, S., 2021. Revised geological map of Mount Nansen area
Margin of North America, Canadian and Alaskan Cordillera. Geological Association (NTS 115I/3 and part of 115I/2). Yukon Geological Survey, Open File, 2021-2, 2
of Canada, Special Paper 45, pp. 323–360. maps, scale 1:50,000 and 1:20,000.
Nelson, J.L., Colpron, M., Israel, S., 2013. The cordillera of British columbia, Yukon, and Shalev-Shwartz, S., Ben-David, S., 2014. Understanding Machine Learning: from Theory
Alaska: tectonics and metallogeny. In: Colpron, M., Bissig, T., Rusk, B.G., to Algorithms. Cambridge university press, New York, NY, p. 449pp.
Thompson, J.F.H. (Eds.), Tectonics, Metallogeny, and Discovery: the North American Vapnik, V.N., 1998. Statistical Learning Theory. John Wiley & Sons, INC, New York,
Cordillera and Similar Accretionary Settings. Society of Economic Geologists, Chichester, Weinheim, Brisbane, Singapore, Toronto, p. 732pp.
pp. 53–109. Weil, R.R., Brady, N.C., 2017. The Nature and Properties of Soil, fifteenth ed. Pearson
Opitz, J., Burst, S., 2019. Macro F1 and Macro F1. https://arxiv.org/abs/1911.03347. Education Limited, Harlow, Essex, England, pp. 51–100.
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Zhu, J., Zou, H., Rosset, S., Hastie, T., 2009. Multi-class AdaBoost. Statistics and Its
Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Interfeace 2 (3), 349–360. https://doi.org/10.4310/SII.2009.v2.n3.a8.
13