A Primer on Clustering of Forest Management Units for Reliable Design-Based Direct Estimates and Model-Based Small Area Estimation

Georgakis, Aristeidis; Gatziolis, Demetrios; Stamatellos, Georgios

doi:10.3390/f14101994

Open AccessArticle

A Primer on Clustering of Forest Management Units for Reliable Design-Based Direct Estimates and Model-Based Small Area Estimation

by

Aristeidis Georgakis

¹,

Demetrios Gatziolis

^2,* and

Georgios Stamatellos

¹

Laboratory of Forest Biometry, School of Forestry and Natural Environment, Aristotle University of Thessaloniki, 54124 Thessaloniki, Greece

²

USDA Forest Service, Pacific Northwest Research Station, Portland, OR 97204, USA

^*

Author to whom correspondence should be addressed.

Forests 2023, 14(10), 1994; https://doi.org/10.3390/f14101994

Submission received: 14 August 2023 / Revised: 20 September 2023 / Accepted: 29 September 2023 / Published: 4 October 2023

(This article belongs to the Section Forest Inventory, Modeling and Remote Sensing)

Download

Browse Figures

Versions Notes

Abstract

:

This study employs clustering analysis to group forest management units using auxiliary, satellite imagery-derived height metrics and past wall-to-wall tree census data from a natural, uneven-aged forest. Initially, we conducted an exhaustive exploration to determine the optimal number of clusters

k

, considering a wide range of clustering schemes, indices, and two specific

k

ranges. The optimal

k

is influenced by various factors, including the minimum

k

considered, the selected clustering algorithm, the clustering indices used, and the auxiliary variables. Specifically, the minimum k, the Euclidean distance metric, and the clustering index were instrumental in determining the optimal cluster numbers, with algorithms exerting minimal influence. Unlike traditional validation indices, we assessed the performance of these optimally defined clusters based on direct estimates and additional criteria. Subsequently, our research introduces a twofold methodology for Small Area Estimation (SAE). The first approach focuses on aggregating forest management units at the cluster level to increase the sample size, thereby yielding reliable design-based direct estimates for key forest attributes, including growing stock volume, basal area, tree density, and mean tree height. The second approach prepares area-level data for the future application of model-based estimators, contingent on establishing a strong correlation between target and auxiliary variables. Our methodology has the potential to enhance forest inventory practices across a wide range of forests where area-level auxiliary covariates are available.

Keywords:

cluster analysis; optimal number of clusters k; WorldView-derived vegetation canopy height; census inventory; clustering indices; PAM; k-means; hierarchical algorithms; correlation analysis

1. Introduction

Accurate forest inventories are essential for supporting sustainable forest management plans and ensuring the long-term viability of forests and their products [1,2]. Forest inventories provide baseline information, such as Growing Stock Volume (GSV), basal area, tree/stem density, and tree height, all essential biometric variables for sustainable management and planning. However, typical forest management inventories often face significant challenges in providing reliable estimates for each Forest Management Unit (FMU) due to the increased sampling effort required. Small Area Estimation (SAE) techniques have emerged to address this challenge by providing estimates for small geographic subpopulations or domains within the larger population [3] or forest areas in our study.

SAE methods offer two primary categories of estimators: design-based direct estimators, including model-assisted, and indirect model-based ones [4] (p. 45). Design-based direct estimators are exclusively dependent on data collected from a probability sample of plots. The Horvitz–Thompson or Simple Random Sampling (SRS) are typical examples of such estimators [3]. The SRS uses only sample survey data to estimate sub-population totals, means, proportions, or other parameters for small areas, and is considered to be design-unbiased. The accuracy of design-based direct estimates depends on factors such as sample representativeness, sample size, and plot size or sampling intensity. The approach of design-based direct estimation is the most straightforward SAE technique, but it can be cost-ineffective as it requires a large number of field sample plots per small area of interest.

Cluster analysis is an unsupervised machine learning technique that partitions observations or auxiliary data into meaningful groups with similar characteristics. Clustering methods can be categorized into two broad categories, hierarchical and non-hierarchical [5]. Of the non-hierarchical clustering methods, the most popular are the k-Means [6] and the PAM or k-Medoids [7] algorithms. Details on clustering methods are offered in the subsequent sections of this paper. The determination of the optimal number of clusters, denoted as

k

, is a crucial process step. An ideal clustering for SAE purposes has the following characteristics: many domains or small areas or strata with an adequately small sample size that is almost equal per cluster, and substantial correlation with the auxiliary variables.

A prerequisite for conducting cluster analysis is the availability of auxiliary variables that can describe heterogeneity across FMUs. Conducive auxiliary variables can be from past census data, remote sensing data, abiotic characteristics such as terrain aspect and slope, or data that describe the spatial contiguity between the FMUs (e.g., FMU centroid coordinates). The advent of digital aerial photogrammetry and automated image matching [8] and high spatial and temporal resolution satellite remote sensing [9,10] has enabled the extraction of detailed descriptors ranging from local to global scales. These technologies offer tangible benefits compared to traditional remote sending platforms including expeditious deployment or cost-effectiveness [11].

To enhance precision, model-assisted and model-based estimators incorporate additional auxiliary information and modeling techniques. Well-known model-based SAE estimators that incorporate random area effects (mixed models), such as the Fay–Herriot area-level models [12] and unit-level models [13], are increasingly applied to forest inventories [14,15,16,17]. Particularly suited for use after our clustering analysis, the Fay–Herriot model serves as an advanced extension of composite estimators. This model enhances the accuracy and reliability of small area estimates by integrating area-specific random effects. In contrast to synthetic estimators, which do not weight estimates, or traditional composite estimators that merely blend direct and indirect estimates without accounting for area-specific variability, the Fay–Herriot model incorporates random effects that capture each small area’s unique characteristics. These random effects are instrumental in refining the estimates, rendering them more reliable and nuanced. Within the context of model-assisted estimation, generalized regression estimators [18] are commonly employed, particularly in the context of national forest inventories.

Cluster analysis, also known as clustering, can support the direct estimation approach by identifying and grouping relatively similar and homogeneous objects such as FMUs (forest stands and forest compartments) into clusters (domains or small areas). This aggregation relies on auxiliary data that are area-level or observations such as height metrics that effectively increase the sampling intensity. Cluster analysis has been used in various forest applications, including harvest scheduling models [19], or as a spatially explicit schedule optimization that considers accessibility and proximity to the road network [20]. Recent research uses a top-down binary space partitioning algorithm to downscale the estimates, from a whole set of municipalities to individual municipalities. This approach recursively divides the hexagon municipality list into two equally sized sets, thus balancing a preselected statistical precision with scale estimation, generating the smallest possible groups of domains [21]. To the best of our knowledge, clustering analysis has not been used in SAE for design-based direct estimates. Preliminary work [22] suggests accurate direct estimates obtained by grouping forest compartments using the Partitioning Around Medoids (PAM) method, the average silhouette width, and past census data.

This study introduces an innovative methodology for cluster analysis to determine the optimal number of clusters (

k

) for SAE purposes. The initial novelty of our research stems from its comprehensive examination of various clustering variables, algorithms, and distance metrics to optimize (

k

) values. The SAE objectives of this study are twofold. Firstly, utilizing satellite-derived height metrics and census data, we aim to group FMUs into updated “small areas” or domains with larger sample sizes, where reliable direct estimates are feasible for key forest attributes like growing stock Volume, Basal Area, Tree Density, and Mean tree Height. Subsequently, in cases with large standard errors of direct estimates or with higher demands for better precision, we perform a correlation analysis among the response and the auxiliary variables to meet one of the most important model-based SAE assumptions for future estimations.

2. Materials and Methods

2.1. Study Area

The study area is in the Pertouli University Forest of Central Greece (39.54^o N, 21.50^o E) (Figure 1). The area features an uneven-aged, naturally regenerated coniferous forest dominated by Abies borisii-regis (Mattf.) (hybrid fir). The forested area covers 2260 ha. An additional 1037 ha are covered by meadows or bare land. The forest is characterized by complex topography, multistoried dense canopies, and variable tree height and stem diameter distributions. Hybrid fir dominates 95% of the forest land, while the remaining 5% is split among Pinus nigra (J.F. Arnold) and mixed fir-beech forest compartments or FMUs. FMU subpopulations were identified using common forest management planning techniques which, in turn, are reliant upon landscape features like streams, ridges, and roads. The region’s climate is transitional Mediterranean-Central European with cold and rainy winters and hot and dry summers [23].

2.2. Data

2.2.1. Sampled Data and Variables of Interest

The study area comprises 174 FMUs or compartments. The survey data consists of 252 0.1 ha sample units (plots), drawn from systematic sampling corresponding to a sampling intensity/fraction of 1%. Upon the exclusion of unmanaged FMUs lacking sample plots and those with insufficient auxiliary data, 160 FMUs with 239 sample plots remained, and were subsequently used in the analysis. About half of the FMUs contain one sample plot. The remaining half contains 2 plots. A few contain either three or zero plots [23]. The variables of interest (with the first letter capitalized) are growing stock Volume (

m^{3} {h a}^{- 1}

), Basal Area (

m^{2} {h a}^{- 1}

), stem/Tree Density (Trees/ha), and (Lorey’s) Mean Height

h_{L}

(m) (Table 1). Lorey’s mean height

h_{L}

[24] is computed by the average height weighted by Basal Area as

h_{L} = \frac{\sum_{i = 1}^{z} n_{i} g_{i} h_{i}}{\sum_{i = 1}^{z} n_{i} g_{i}}

, where

n_{i}

is the number of trees,

g_{i}

is the mean basal area, and

h_{i}

is the mean height of trees in the

i

-th diameter class, considering trees with diameter at breast height ≥7 cm. Summary statistics for the variables of interest from the field sample plots are presented in Table 1, and indicate greater variability in the total mean estimates for the Tree Density compared to Volume, Basal Area, and Mean Height, as evident from the standard deviation and the relative standard error.

Systematic sampling was selected because it is a widely used method in forest inventories, offering inherent advantages such as representative sampling, reliable population estimates, and cost-effectiveness. It ensures uniform sampling intensity and evenly distributed coverage across the surveyed forest area. Additionally, it allows for straightforward variance estimation using SRS estimators, eliminating the need for random selection processes for every sample, thus making the process faster and more efficient [24] (pp. 318–324).

2.2.2. Clustering Variables

Given the low sample intensity (1%) and the limited number of only 0–3 sample plots per FMU, a clustering scheme solely based on the field data would be insufficient to capture the internal heterogeneity/structure of the FMU objects. To address this, we utilized wall-to-wall auxiliary variables known to correlate with the variables of interest. Specifically, a 2 m Canopy Height Model (CHM) (Figure 2) was utilized to calculate tree height distribution moments for each FMU, including the robust to outliers L-moments [25] (Table 2). This CHM was derived by subtracting the cell elevation values of a Digital Terrain Model (DTM) computed using LiDAR data from those of a co-registered Digital Surface Model (DSM) generated using a WorldView 2 satellite data stereo pair. The 2 m LiDAR-based DTM was a courtesy of the Aristotle University Forest Administration and Management Fund [23]. The DSM generation process capitalizes on the parallax in two overlapping images covering the study area and acquired seconds apart during the same satellite pass and the corresponding locations of the satellite platform to extract three-dimensional information quantized in the form of a raster of elevations. The registration process, detailed in [26], uses a rational polynomial coefficient file that describes the relationship between object and image coordinates. The DSM generation algorithm is described in [27].

Additional auxiliary data comprised FMU-specific records of tree density and volume records from a complete enumeration of trees in past (1988 and 1997) censuses. Although dated, those data remain relevant, thanks to the sustainable management of the uneven-aged that ensures similar distribution of volume and tree density across the FMUs and their strong correlation to recent height metrics. A third auxiliary variable quantified FMU proximity as either “hard” or “soft” based on polygon centroids calculated in a Geographic Information System (GIS). The soft constraint was applied when incorporating other auxiliary variables, allowing for a balance between attribute similarity and spatial proximity, thereby permitting some non-contiguity within the clusters. The hard constraint was used in the absence of auxiliary variables to enforce strict spatial FMU contiguity within a cluster. A summary of the wall-to-wall area-level auxiliary variables, encompassing height-related statistics and census data, is provided in Table 2.

2.3. Methodology

2.3.1. Clusterability and Variable Selection

We evaluated the suitability of auxiliary variables for clustering through both visual and statistical methods. Utilizing the Hartigans’ Dip test [28,29], we confirmed that the FMU-specific aggregated height mean (hmean) is at least bimodal, suggesting the presence of multiple clusters. The Hopkins statistic [30], which prioritizes statistical over spatial nearest neighbors, yielded a result of 0.89—significantly higher than 0.5 and close to the optimal value of 1—thereby confirming that the distances in our dataset are variable and not random, and that distinct clusters do exist [28]. Additionally, we employed the Visual Assessment of Cluster Tendency (VAT) approach [31] to further validate clusterability. The VAT algorithm reorders the scaled dissimilarity matrix of observations and generates a two-dimensional map where similar observations are grouped and dissimilar ones are separated. In the resultant two-color map [32], red indicates high similarity and blue denotes low similarity (Figure 3). Further evidence of effective clustering was observed in boxplots, as illustrated in Figure A8 (Appendix A), where FMUs, represented by hmean values, were closely centered around a central value within each cluster and were well-separated from other clusters. These findings collectively provide compelling evidence for the existence of distinct clusters within our dataset.

The selection of clustering variables was guided by three key approaches: prior knowledge, brute force, and data dimensionality reduction technique through Principal Component Analysis (PCA). Initially, prior knowledge guided our choices, as we examined the relationship between the variable of interest and potential clustering variables, detailed in Table 2. Brute force was involved in an exhaustive investigation to identify the most effective variables or combinations for partitioning the population. PCA was applied to perform an orthogonal transformation of multidimensional data. The contributions of variables to the orthogonal axes, known as “loadings”, served as strong indicators of their utility for clustering [33]. Two separate PCAs were conducted: the first encompassed all available data, including census and height metrics, while the second focused solely on height data. The first PCA indicated that various height metrics—mean (hmean), median (h50), 25th percentile (h25), 75th percentile (h75), and mode (hmode) each contributed more than 5% and were thus included in the clustering process. The second PCA highlighted the standard deviation of the height (hsd) as having the highest clustering potential. The third PCA component revealed that census data, specifically stem density and volume, were the primary contributors, and were therefore selected as the final auxiliary variables for clustering. Additional information on PCA is provided in Appendix A and Figure A1 and Figure A2. For the reader’s convenience, Appendix A also includes the clustering terminology used in this study.

2.3.2. Optimal Number of Clusters and Sample Size per Cluster

In cluster analysis for SAE purposes, the minimum number of sample plots per cluster, plot size, and sampling intensity are key considerations. In this study, each cluster functions similarly to a stratum. The minimum number of plots per stratum is determined by the variable of interest and the desired level of precision. To ensure stable means and their standard errors, at least 10 within-strata sample units are recommended [34,35]. According to the USDA Forest Service Forest Inventory and Analysis Program protocol [36], a minimum of 4 (Phase 2) cluster plots or 12 forested plots per stratum is required [37]. As a result, we aimed to average 10 plots per cluster, resulting in clusters with an average of 7 FMUs.

When registering sample plots with remote sensing data, the planar plot size (area) may be less important than the resolution of the auxiliary data or registration errors. Various plot sizes have been used in models, often less than 600

m^{2}

, with 400

m^{2}

being the most common [38]. For instance, one study tessellated the forest population into 250-m² cells and used national forest inventory circular plots of equal size at intervals of a 3 km × 3 km grid [34]. Additionally, many simple, single-stage sampling design, model-based forest inventories, ignore the sampling design [39,40]. The optimal number of strata or clusters may vary depending on the study’s specific context and objectives. While some suggest that precision does not improve significantly with more than 6–8 strata [41], others recommend up to 10 strata for post-stratification [42]. In our study, we use strata not as a means to improve estimates, but as areas of interest. With relatively large sample plots (1000

m^{2}

) and a high sampling fraction (1% of the population), we aim to increase sample size or define larger “small areas” through clustering to enhance the efficiency of direct estimates.

2.3.3. Optimal and Best Clustering Schemes

In this section, we elaborate on the methodology employed to derive “optimal” and “best” clustering schemes, with a particular focus on identifying the optimal number of clusters, denoted as k. Our primary objective was to discern factors most significantly influencing the determination of the optimal k. A "clustering scheme" is defined as a specific strategy that includes the selection of variables, algorithms, and the similarity or distance metrics used (Table 3). Different clustering schemes can yield varying clustering solutions and might not always produce an optimal

k

. We identified three primary factors that play a crucial role in determining the optimal k: (i) the clustering index, (ii) the minimum number of clusters, and (iii) the maximum number of clusters. Traditionally the minimum number of clusters is automatically set to k = 2. To avoid ineffective clustering, we aimed for a

k

value substantially larger than this minimum, defining a range for

k

by setting minimum and maximum thresholds.

We used various clustering indices to determine the optimal

k

(Table 3). Each index provided a specific value, and the ideal clustering scheme emerges from the most effective partitioning of the data into clusters. This partitioning is determined by a particular clustering index, and its optimal value is recognized as the optimal k. We conducted an exhaustive exploration for the optimal k, covering a wide array of clustering schemes, indices, and two specific

k

ranges. We categorized the clustering schemes into two distinct methodologies (1st and 2nd), detailed in Table 3:

First Methodology: This focused on 13 distinct clustering schemes defined using the PAM algorithm, the application of the Euclidean distance metric, and 13 specific clustering variables. For this methodology, $k$ was set to between 2 and 50.
Second Methodology: This broader approach evaluated a combination of 468 individual clustering schemes. Each scheme was optimized using one of four clustering indices (Table 3). Thus, we identified 4 × 468 = 1885 optimal clustering scheme variants. The initial $k$ range of 2 to 50 was modified to between 8 and 50.

We hypothesized that the first methodology, based on the PAM algorithm, might more consistently identify the optimal

k

for SAE purposes when compared to the second. This supposition is based on the streamlined focus of the first methodology in contrast to the comprehensive breadth of the second, which encompasses a significantly larger set of clustering schemes.

A subset of clustering schemes, based on their best index values for the second methodology, is presented in Table 3. The best index was determined by ranking the indices of the optimal clustering schemes and selecting those that are globally optimal, represented by the highest index values. These schemes were designated as the “best”, distinct from the “optimal” ones, and were used for subsequent SAE applications. Note that each clustering scheme, even when using a consistent k, can yield diverse cluster compositions or FMU groups.

2.3.4. Preprocessing and Dissimilarity Metrics

Before applying the clustering algorithms, we performed data scaling to ensure equal weight across all variables in distance calculations. The dissimilarity matrix, representing the distances between observations of auxiliary variables (Distance = 1 − Similarity), was computed using four distance metrics: Euclidean, maximum, Manhattan, and Minkowski, as detailed in Table 3 [42]. The specific clustering algorithms and indices employed are elaborated in subsequent sections. The workflow for cluster analysis, which aggregates similar FMUs by finding the optimal

k

, and the procedure for selecting the best clustering schemes for both direct and model-based small area estimates is depicted in Figure 4.

2.3.5. Clustering Algorithms

Clustering algorithms are central to cluster analysis. Various algorithms exist for this purpose, including partitioning, hierarchical, density-based, grid-based, model-based, and constraint-based methods [5]. These algorithms can be broadly categorized into “hard” and “soft” clustering methods. Hard clustering methods provide a definitive, exclusive cluster assignment for each object; in our case, it is an FMU [7]. In contrast, soft clustering methods, such as fuzzy and model-based techniques, allow for more flexible assignment of objects to specific clusters [5]. Selecting the most appropriate clustering methodology can be challenging due to the inherent strengths and weaknesses of each algorithm, and the specific requirements of the research question. One way to choose the best clustering technique is to test multiple algorithms and evaluate their performance using internal or external validation criteria. Alternatively, domain knowledge can serve as an external evaluation metric.

In this study, we employed both partitioning (non-hierarchical) and hierarchical algorithms, which are considered as hard clustering methods. Several agglomerative hierarchical algorithms, including Ward.D, Ward.D2, single, complete, average, McQuitty, median, centroid [43,44], and non-hierarchical alternatives including the k-means [5] and k-medoids were explored. The k-medoids method implemented using the PAM algorithm is considered as an advancement of the k-means clustering method. Unlike k-means, k-medoids represent actual data objects, handle outliers more effectively, and thus provide better discretization in specific clusters [45]. However, in normally distributed data, these methods are unlikely to exhibit significant performance differences.

2.3.6. Clustering Indices

A clustering index serves as a quantitative measure of clustering performance. A clustering scheme is considered optimal when

k

is determined using such an index. Larger

k

values tend to promote homogeneity within objects but may compromise meaningfulness and reduce correlation with auxiliary variables. Given the varying evaluation criteria imposed by different indices, it is advisable to experiment with multiple optimal clustering schemes. In this study, four different indices were identified as most appropriate: Caliński and Harabasz [46], Krzanowski and Lai [47], and the average Silhouette width index [48] determine the optimal

k

based on the index’s maximum value; the fourth index, Friedman and Rubin [49], use the maximum difference between hierarchy levels of the index. Additional methods like the elbow method, the gap statistic, the Dunn index, and the Hubert index were also explored but yielded a small optimal number of clusters (2–4), rendering them unsuitable for this study’s objectives. Further details on the clustering index metrics can be found in [43] and in Appendix A (Table A1).

2.3.7. Evaluation of Design-Based Direct Estimates

The evaluation of the optimal clustering schemes involves cluster validation which is an important task in assessing effectiveness. Unlike supervised learning algorithms, evaluating the performance of unsupervised machine learning clustering algorithms poses unique challenges. Both the Silhouette index and Dunn index were excluded from the evaluation. The Silhouette index’s value tends to decrease as the number of variables increases, thereby complicating its interpretive utility. Moreover, both the Silhouette index and Dunn index [50], are internal evaluation metrics, which may not directly align with this study’s primary objectives. These objectives include evaluating the direct estimates obtained post-clustering the FMUs and assessing the feasibility of model-based SAE through robust correlations among variables.

In this context, clustering functions as a preprocessing step, and its performance are evaluated based on its utility for SAE. Our evaluation focused on external metrics that directly assessed the reliability of the direct small area estimates within each cluster and their correlation with the selected auxiliary covariates. After identifying optimal clustering schemes, we selected a few specific “best” clustering schemes solutions based on the following evaluation criteria.

The evaluation of the “best” clustering schemes for SAE was conducted at both cluster and population levels. At the cluster level, we focus on assessing the performance of individual clusters or subpopulations. For this purpose, we calculate the Volume for each cluster using the SRS estimator Equation (1). The effectiveness of the SRS method for each cluster is quantified through standard unbiased sampling variance estimators Equation (2). Additionally, the relative variability of our mean estimates is expressed through the Relative Standard Error (RSE). To assess the reliability of these estimates, we adhere to the widely accepted 10–15% forestry threshold [51,52]. Conversely, the evaluation at the population level synthesizes performance metrics across clusters to assess overall effectiveness and to identify schemes suitable for both direct and model-based estimations.

The population level evaluation criteria for direct estimates were based on several factors. First, we calculated the mean of the RSEs Equation (3) across all clusters to provide an average precision of clusters’ variability. Lower mean RSEs may suggest either fewer clusters or a larger sample size per cluster. Additionally, we assessed the variability and distribution by computing the standard deviation (StD) and the 90th (p90) percentile for both the RSEs and the number of sample plots (nPlots). Lower StD values indicate more consistent clusters in terms of RSEs. The use of the p90 RSE metric focuses on the majority (90%) of the estimates, offering several advantages, including capturing variation within the main distribution and excluding outliers, particularly those with small sample sizes. Accordingly, the mean and StD of nPlots represent the average number of sample plots and their variability across all clusters in a given clustering scheme. We also excluded clusters with only one plot (1 plot), as they prevent variance estimation and RSE calculation for specific small areas. Lastly, we employed the relative efficiency (RE) Equation (4) as an internal-like cluster evaluation criterion to compare the sampling variance of means within clusters to that of SRS for the entire population.

To illustrate the notation used for direct estimation, we consider a finite population

U

of size

N

, partitioned into

k

clusters or domains

U_{1}, U_{2}, \dots, U_{k}

of sizes

N_{1}, N_{2}, \dots, N_{k}

. Here,

k

refers to the k-th cluster, with

k = 1, \dots, K,

and

j

refer to the

j

-th sample plot, with

j = 1, \dots, N_{K}

, within a cluster. Sample data

s_{k}

are obtained from this population by drawing a sample size of

n

plots, with

n_{1}, \dots, n_{K}

plots for each domain. The target variable is denoted by

y_{k j}

and the sample mean for the

k

-th cluster

U_{k}

is obtained using the following:

{\hat{\bar{Y}}}_{k} = {\bar{y}}_{k} = \frac{1}{n_{k}} \sum_{j = 1}^{n_{k}} y_{k j}

(1)

The variance of the means Equation (3) is assumed to be equal to that of SRS with replacement without considering the finite population correction (FPC) because the sampling intensity of 1% is smaller than 5% [41] (p. 39), and thus, has a negligible influence on the variance estimates.

We estimated the variance of the mean estimator

\hat{V} ({\bar{y}}_{k}) = \frac{s_{k}^{2}}{n_{k}}

with sample variance:

s_{k}^{2} = \sum_{j = 1}^{n_{k}} {(y_{k j} - {\bar{y}}_{k})}^{2} / (n_{k} - 1)

(2)

The RSE of the mean is equal to

{R S E}_{k} = {S E}_{k} ({\bar{y}}_{k}) / {\bar{y}}_{k}

(3)

where the standard error of the mean

{S E}_{k}

for the domain

k

is

{S E}_{k} ({\bar{y}}_{k}) = \sqrt{\hat{V} ({\bar{y}}_{k})}

.

The RE serves as a metric to evaluate the efficiency of individual clusters in reducing internal variance relative to the variance derived through SRS from the overall population. The formula for calculating

{R E}_{k}

for a specific cluster

k

is

{R E}_{k} = \hat{V} ({\bar{y}}_{S R S}) / \hat{V} ({\bar{y}}_{k})

(4)

where

\hat{V} ({\bar{y}}_{S R S})

represents the variance obtained through SRS from the entire population. An

R E_{k} > 1

for a specific cluster

k

indicates that the cluster is more efficient, as it yields a lower internal variance compared to the variance obtained through SRS from the population. Conversely, an

R E_{k} < 1

implies that the cluster is less efficient, as its internal variance is greater than that derived from the SRS of the population. An aggregated distribution of

R E_{k}

values can offer insights into the overall effectiveness of a clustering scheme. Note that we use RE to confirm the effectiveness of a clustering scheme, not to compare different SAE techniques.

The statistical analysis was implemented using the open-source statistical software R [53], the R packages “cluster” [54] for the first clustering methodology, and “NbClust” [43] for the second clustering methodology.

2.3.8. Model-Based SAE: Correlation Analysis

Transitioning from design-based to model-based estimations necessitates additional data preprocessing steps. Initially, after clustering FMUs, we calculated the auxiliary variables at the cluster level using a weighted mean approach. In this approach, the FMU area size and the CHM cells’ coverage percentage served as weights for the height metrics, ensuring more accurate estimates of auxiliary variables at the cluster level. Subsequently, we selected the best clustering schemes based on specific criteria: no more than 16% of

k

-clusters should contain only one plot, and a minimum of

k

= 15 clusters was considered acceptable [55]. Finally, we conducted a correlation analysis between all variables of interest and the auxiliary covariates. The goal was to identify strong correlations among the variables, a prerequisite assumption for constructing reliable area-level SAE models [3]. Additional emphasis was placed on identifying strong correlations in the extended multivariate Fay–Herriot models, with a focus on ensuring robust correlations between the response variables, as this is a critical consideration for potential future research [56]. To quantify the linear relationship between the response and predictor auxiliary variables, we estimated the Pearson correlation coefficient as the measure of association. We also assessed the strength and statistical significance of the observed correlations using p-values. Significance levels were denoted as follows: “***” for p-value < 0.001, “**” for p-value < 0.01”*” for p-value < 0.05, and “.” for p-value < 0.10. In our study, correlations with p-values up to 0.01 (denoted by ‘**’) were considered statistically significant.

3. Results

3.1. Analysis of Optimal Clustering Schemes

In this section, we present the analysis of the optimal number of clusters

k

and their distribution, which are information conducive to detect the underlying patterns and variations in the FMUs. With

k

in the 2–50 range, the first clustering methodology (Table 3) consistently yielded an optimal number of clusters larger than 8 for different variables, thereby obviating the need to increase the minimum threshold for

k

(Figure 5). The second clustering methodology produced 468 clustering schemes and four times as many indices (4 × 468), with the most frequently occurring “optimal number of clusters” also falling within the 2–50 range. It is worth noting that the smaller optimal

k

values consisted of different clustering schemes without apparent influence by the algorithm, distance, or clustering variable.

k

= 2 is meaningless for this study as it leads to a coarse grouping of the data, unlikely to capture the underlying patterns or variations in the FMUs. Conversely, while

k

= 50 indicates an increasing trend toward the optimal number of clusters, it inflates the sampling variance and leads to, on average, only 5 sample plots per cluster. To balance capturing underlying patterns and minimizing variance, we set the minimum threshold as k = 8 for the second clustering methodology. This choice is smaller than the frequently advocated reference value of 10 strata/clusters and corresponds, on average, to 30 plots per cluster. By applying the first methodology using the Volume (sampled data), the optimal number of clusters was found to be 21, substantially larger than the minimum considered k = 8.

The first clustering methodology is more straightforward as it relies solely on the PAM algorithm, Euclidean distance, and the Silhouette index, without the need to set a minimum

k

threshold. In contrast, the Silhouette index in the second methodology often yielded a

k

value greater than 40, frequently reaching

k

= 50. The Caliński–Harabasz index produced similar outcomes, with

k

often equalling 50 (Figure 6). Compared to the Silhouette (Figure A3 and Figure A4) and Caliński–Harabasz indices (Figure A5 and Figure A6), the Friedman and Rubin index generated more favorable outcomes, although extreme values (

k

= 50) were still observed.

The Krzanowski and Lai index proved to be a more suitable choice for determining the optimal number of clusters

k

(Figure 7), consistently yielding values in the 8–50 range.

The variable hmode consistently yielded an optimal of

k

of 18 clusters, leading to its exclusion from Figure 6 and Figure 7. Overall, the choice of index and algorithm were the most influential factors in determining the optimal

k

, while the distance metric and the clustering variables had minimal impact. In summary, the optimal

k

can be readily ascertained using the 1st clustering methodology with the PAM and Silhouette index, or solely the Krzanowski and Lai index in the case of the 2nd methodology.

3.2. Design-Based Direct SAE

Utilizing the “Best Index Value” criteria detailed in the methodology (Table 3), we identified 51 “best” optimal clustering schemes from the results of the 1st and 2nd clustering methodologies and used them as candidates for SAE applications. These schemes were evaluated for their suitability in direct estimations of Volume. Out of these, 34 were selected for further direct and model-based estimation. According to the criteria we set, the top 8 clustering schemes in Table 4 provided reliable direct estimates. After excluding clusters that had only one plot—an average occurrence in each scheme—these top 8 schemes partitioned the forest into an average

k

= 15.3 clusters. The clustering schemes had mean RSEs of 8.86%, an average of 17.56 plots per cluster, and covered an average of 13.3 FMUs or 172.7 hectares of forest land. For example, one specific scheme (SN 1, Table 4) achieved excellent direct estimates with

k

= 13, a mean of RSEs of 8% (0.08), p90 RSEs of 10%, and an average of 18 plots per cluster.

Another noteworthy scheme is SN 6 (Table 4), which employed soft spatial consistency and utilized the PAM algorithm, the Silhouette index, and Euclidean distance. This scheme had

k

= 30, mean RSEs of 10%, p90 RSEs of 12%, and an average of 8 plots per cluster. The spatial distribution of inventory variables for this scheme is illustrated in Figure 8. In contrast, hard constraint clustering, which enforces spatial contiguity without considering auxiliary variables, consistently produced larger RSEs, and was therefore excluded from Table 4. This suggests its unsuitability for SAE estimates. However, soft constraint clustering appeared in 5 out of 8 instances among the 34 best clustering solutions in Table 4, achieving smaller mean RSEs of up to 10%. In summary, the data in Table 4 reveal a trend: as the mean of the RSEs increases, so does the value of

k

. With a higher

k

, more clusters consisting of just one sample plot must be excluded for SAE purposes, as they cannot provide sampling variance estimates. However, these single-plot clusters can identify outliers that consist of one sample plot. Figure 9 illustrates the RSE distribution of the SN 25 scheme for the four variables of interest. Among these, Tree Density displayed the largest RSEs, while Mean Height had the lowest RSEs.

Among the dissimilarity measures, Euclidean distance emerged as the most prevalent, appearing in 85.3% of the 34 selected optimal clustering schemes listed in Table 4. The algorithms k-means and PAM were the most frequently used, each accounting for 26% of the schemes, followed by Ward.D at 15%. Specifically, for the 1st clustering method, the PAM algorithm occurred 9 out of the initial 13 times (Table 3), indicating a near 70% likelihood of suitability for SAE purposes. For the 2nd clustering method, the most common algorithms were k-means (36%), Ward.D (20%), single (12%), median (12%), complete (8%), and average (8%), with McQuitty appearing only 4% of the time. Algorithms Ward.D2 and centroid were not used at all in these schemes.

Certain variables, such as hmean, h50_X_Y, h50, and hLcv, were more frequently featured in the best-performing schemes. Specifically, the single variables h50 and hmean appeared 13 and 9 times, respectively, out of 34 schemes (Table 4), either used individually or in combination with other variables. The Silhouette index was the most common among clustering indices at 44% for both clustering methodologies, followed by CH, KL, and FR at 24%, 21%, and 12%, respectively. Notably, the value of the Silhouette index is higher when only one variable is used and decreases as the number of variables increases. This observation suggests that the Silhouette index should be evaluated carefully, taking into account both its value and the number of variables involved. Otherwise, clustering schemes that incorporate multiple variables may encounter issues.

Figure 10 presents the RE for all variables of interest within clustering scheme SN 25. The figure validates the effectiveness of the partitioning approach. The majority of clusters exhibit an RE greater than 1, indicating reduced intra-cluster variability relative to the population’s simple random sample variability.

3.3. Correlation Analysis for Model-Based SAE

We conducted a correlation analysis to examine the relationship between response and predictor variables, using scatterplot matrices and Pearson correlation coefficients, as illustrated in Figure 11. One key observation was that small areas with limited sample sizes—particularly those with only one sample plot, followed by those with two or three—negatively impacted the correlation and thus the validity of the estimators. These outliers could be identified through both design-based standard errors and model-based correlation analysis. Moreover, the high standard deviation of RSEs often indicated the presence of outliers or domains characterized by extreme variability, which in turn adversely affects the correlation. Clustering schemes with a higher number of clusters

k

were more likely to contain outliers, thereby weakening the correlation among variables. While the conventional Pearson correlation was not robust in the presence of outliers, employing a weighted correlation approach that accounted for sample size allowed us to effectively quantify the impact of outliers, as demonstrated in Figure A7.

Nine out of the thirty-four selected best clustering schemes with

k

≥ 15 were particularly well-suited to model-based estimation, as detailed in Table 4. We noted excellent correlations with

k

< 15, yet

k

= 15 is considered a modest number of small areas for SAE [54]. For instance, the clustering scheme SN 25, which initially had a k of 42 but was reduced to 32 after outlier removal, showed strong correlations across multiple domains, as evidenced in Figure 11. This scheme had 219 plots out of 239, 139 domains out of 160, an average of 6.36 sample plots per cluster, and 0.12 m mean RSEs.

The target variables that displayed the highest Pearson correlation coefficients with the height-related covariates, were, in decreasing order, Mean Height, Volume, Tree Density, and Basal Area (Figure 11). We observed weak or absence of correlation between Basal Area and height metrics. Volume exhibited a strong correlation with Basal Area and Mean Height but no correlation with Tree Density. Tree Density showed a good negative correlation only with Mean Height. Lastly, Mean Height demonstrated a good correlation with Tree Density, and no correlation with Basal Area. In summary, Volume and Mean Height displayed strong correlations with two of the three predictor variables, whereas Basal Area and Tree Density only correlated with one.

Our results confirmed that an increase in

k

or the mean of RSEs led to weaker correlations between auxiliary variables and direct estimates. Conversely, when the sample size within each cluster increased, the correlations between the target variables and auxiliary covariates improved. In summary, our findings suggest that the type of clustering, the sample size, and the presence of outliers have a significant influence on the correlation between target variables and auxiliary covariates. Soft spatial constraint clustering appeared to negatively impact correlation, while hard constraint clustering was not suitable for direct SAE. Given that our study was conducted on uneven-aged forests, likely the hardest-case scenario, we anticipate that applying our methodology to even-aged forests could yield improved correlations.

4. Discussion

The proposed clustering methodology makes a significant contribution to SAE in forest inventory. It offers both direct and model-based estimates for aggregated FMUs. This approach is particularly valuable as a pre-processing step when small sample sizes are encountered or when a model-based approach is not feasible. A model-based approach is not feasible for area-level models when there is a weak correlation between auxiliary and target variables, and for unit-level when there are registration errors. In area-level models, constraints emerge when the small sample size per domain negatively impacts the correlation. This study addresses these issues through cluster analysis and area-level covariates, producing reliable design-based direct or model-based estimates in aggregated FMUs. Understanding the trade-off between sample size and the number of clusters is essential for the effective application of the proposed methodology. A clustering scheme with a low mean of the RSEs and their standard deviation indicates precise and consistent estimates. One limitation of this methodology is that it enlarges initial small areas for the increase of sample size, resulting in coarser spatial resolution estimates. However, this is mitigated by clustering these areas into larger, yet similar, areas. Importantly, the validation of this methodology extends beyond internal clustering metrics. It also includes the efficiency of direct small area estimates, which we measure in terms of RSEs.

To our knowledge, no previous studies have used cluster analysis for downscaling design-based direct estimates into finer spatial resolution or defining small areas for model-based SAE. We propose two solutions for SAE, which involve either aggregating the data at the cluster level for direct estimation or using area-level models after clustering. The latter approach becomes valuable when direct estimates are not precise enough, but strong correlations between the variables of interest and the auxiliary covariates exist. Correlation analysis becomes crucial when dealing with small sample sizes per cluster. It is important to remember that the correlation between auxiliary variables and variables of interest can vary based on the clustering scheme used. Generally, the correlation decreases as the number of clusters

k

or the sample size within each cluster decreases. Therefore, balancing the number of clusters with the minimum sample size is critical for ensuring robust direct estimates and a strong correlation with the auxiliary variables. Particular attention should be paid to clusters with a sample size of fewer than 10 sample plots [34,35], as these may be unstable, thereby leading to unreliable direct estimates.

Our study incorporated soft and hard spatial contiguity clustering using FMUs centroids, with and without auxiliary variables. The results indicate that soft spatial clustering can enhance the accuracy of direct estimates. We suggest further research to explore more advanced spatial clustering methods, such as assigning varying weights to quantify similarity and spatial proximity. In cases where there are spatially contiguous clusters with a strong correlation between auxiliary and target variables, the use of spatial Fay–Herriot methodology is advisable [3,57,58]. Different estimators can be considered when area-level auxiliary data are aggregated, and they show correlation with the variables of interest. Additionally, the widely used model-based Fay–Herriot model has been incorporated in recent studies to explore the application of model-assisted SAE within different sampling schemes [59,60,61,62,63,64].

Based on this methodology, the defined clusters can also serve as strata in post-stratification estimators or for optimal sample allocation in stratified sampling. Linear regression models can treat post-strata as categorical variables [42]. Post-stratification generally outperforms pre-stratification methods for multi-purpose inventories like national forest inventories, yielding more satisfactory results for various variables [65].

Cluster information can be incorporated into the SAE area-level models either as a categorical variable [66] or as an additional parameter to improve the estimates’ efficiency [67,68]. In some studies, employing hierarchical clustering based on covariates has been shown to effectively yield more accurate predictions of small area means. This is achieved by considering whether variance components are equal or unequal across different clusters [69]. Lastly, cluster information can be included as an average area random effects value in both unit-level [70] and area-level models [68,71,72] for predicting non-sampled areas.

In this study, the primary focus was on exploring and determining the optimal number of clusters

k

within predefined FMUs through cluster analysis. Future research could benefit from applying cluster analysis to more precisely delineated FMUs. The approach of automated forest stand delineation, which utilizes height metrics and imagery derivatives to create smaller, more homogeneous forest patches, holds promise. Object-Based Image Analysis (OBIA) has been a prevalent method in this field, using clustering algorithms that consider spatial, spectral, and textural properties [73]. The advent of Airborne Laser Scanning (ALS) technology opens new avenues for exploration. For instance, cellular automaton algorithms, when applied to a CHM, could provide valuable insights into stand homogeneity in terms of GSV, stand area, and shape [74]. Other methods, like the self-organizing map (SOM) or simulated annealing metaheuristic, have also shown promising results in forest stand delineation using ALS data [75,76]. Additionally, a recent approach involving mixed integer programming to analyze ALS-derived raster information also presents an interesting FMU generation perspective [77]. By applying cluster analysis to these more homogeneous FMUs, we anticipate enhancing the sampling representativeness within the newly defined clusters, leading to more precise direct estimates for small areas within a given forest region.

Our methodology has the potential to enhance forest inventory practices across a wide range of forests, particularly when area-level auxiliary covariates are available. We have successfully applied it in a dense, multi-layer, uneven-aged forest with sparse sample plots per FMU. Although the current study focused on direct estimates, we also lay the groundwork for model-based approaches by establishing strong correlations between target and auxiliary variables. Initial results indicate that this methodology can be effectively used in univariate Fay–Herriot models [67]. Its application could also extend to multivariate Fay–Herriot models [78]. Furthermore, the methodology may find even greater success in homogeneous, even-aged forests.

Beyond forestry, this methodology has potential applications in environmental science, agriculture, and epidemiology. It can be useful for poverty mapping and other relative indicators when dealing with extremely small sample sizes or weak correlations under the area-level models’ framework. Future research will continue to explore the optimal number of clusters, k. This will include investigating the influence of new auxiliary clustering variables or clustering indices, as our study primarily used height metrics and tree census data. Researchers in various fields can further validate the importance of the minimum number of clusters, k, and assess the impact of different distance or clustering algorithms, as well as the effectiveness of the PAM algorithm.

In summary, our clustering methodology not only enhances the reliability of SAE in forest inventory but also offers promising avenues for application in diverse fields. Its adaptability to various data types and effectiveness in addressing challenges like small sample sizes and weak correlations make it a valuable tool for future research.

5. Conclusions

The main objective of this study was to examine the interrelationship between the reliability of direct estimates and the identification of an appropriate number of small areas or clusters. The optimal number of clusters

k

was determined based on the small sampling errors observed in selected optimal clustering schemes. In most cases, due diligence should be paid to setting an appropriate minimum threshold for the number of clusters. The minimum

k

= 8 threshold was the primary factor influencing the clustering results in all algorithms, except PAM (1st clustering method). The optimal number of clusters was significantly impacted by the choice of clustering index, while the clustering algorithm, auxiliary variable selections, and distance metric employed were less consequential (for the 2nd clustering method). However, when using many clustering variables, attention must be paid toward decreasing cluster heterogeneity and increasing the expected number of clusters. Our results suggest that effective clustering can be achieved with a selection of one to three variables.

The Euclidean distance metric and Silhouette index were the most prominent in identifying optimal cluster numbers. However, caution is advised when using the Silhouette index with an increasing number of variables due to the observed decrease in the index value. The Krzanowski and Lai index was best suited for assessing the optimal number of clusters using the 2nd clustering methodology. The k-medoids or PAM algorithm, in conjunction with the average silhouette index, Euclidean distance, and height metrics, offers a pragmatic and efficient clustering scheme that generates optimal clusters suitable for Small Area Estimation (SAE) purposes without requiring setting a specific minimum threshold other than

k

= 2 (1st clustering methodology).

The proposed methodology offers promising potential for reliable direct estimates of forest inventory attributes on aggregated FMU level, particularly when faced with challenges such as small sample sizes or lack of unit-level covariates. This study successfully utilized satellite remote sensing-based height metrics and census data as wall-to-wall area-level covariates for clustering homogeneous FMUs. Moreover, a strong linear relationship was found between predicted and auxiliary variables in uneven-aged forests, indicating the potential for improved estimation precision with the use of model-based small area estimators. Our findings suggest promising opportunities for enhancing forest inventory SAE, thereby strengthening forest management practices and decision-making processes.

Author Contributions

A.G. and G.S. conceptualized the study; methodology, A.G.; software, A.G., and D.G.; validation, A.G.; formal analysis, A.G.; investigation, A.G.; resources, A.G., D.G. and G.S.; data curation, A.G. and D.G.; writing—original draft preparation, A.G.; writing—review and editing, A.G. and D.G.; visualization, A.G. and D.G.; supervision, D.G and G.S.; project administration, A.G.; funding acquisition, A.G. and D.G. All authors have read and agreed to the published version of the manuscript.

Funding

This study is part of A.G.’s doctoral thesis that has been financially supported by the Hellenic Scholarship Foundation (IKY) and the European Social Fund—ESF) through the Operational Programme ‘Human Resources Development, Education and Lifelong Learning’ in the context of the Act ’Enhancing Human Resources Research Potential by undertaking a Doctoral Research Sub-action 2: IKY Scholarship Programme for PhD candidates in the Greek Universities’.

Data Availability Statement

Sample survey data and digital maps used in the study are available upon request by the University Forest Administration and Management Fund at Aristotle University of Thessaloniki (https://www.auth.gr/en/university_unit/tameio-uniforest-en/, accessed on 13 August 2023). The Digital Surface Model extracted from the WorldView imagery is available from the authors upon request.

Acknowledgments

The WorldView stereo imagery was obtained via the End User License Agreement (EULA) for the U.S. Federal Civil Government (Title 5 U.S. Code).

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Appendix A.1. Terminology

Clustering/Clustering Analysis: The process of grouping similar objects or FMUs into clusters, based on shared attributes or characteristics (observations).
Cluster: A group of similar objects or an aggregation of FMUs that are similar to each other within a clustering. Clusters are non-overlapping subsets of the population that can be considered small areas or forest subpopulations.
Object: The individual elements that are subject to clustering, specifically referring to homogeneous FMUs or forest patches within the population.
Observations: The measurable characteristics or attributes (clustering variables) that describe the objects and are used to form clusters
Clustering (auxiliary) variables: Clustering variables describe the objects or FMUs using specific attributes such as height metrics, past census data, centroids, or a combination thereof.
Clustering Scheme: A specific approach used in clustering, including the selection of variables, algorithms, and similarity or distance metrics.
Minimum and maximum number of k: The predetermined minimum and maximum number of clusters $k$ that are considered in the clustering analysis.
Optimal $k$ : The ideal number of clusters to divide the data into, determined using specific clustering index criteria.
Optimal Clustering Scheme: This is the specific clustering approach that, using a defined set of variables, algorithms, and similarity or distance metrics, divides data into the optimal $k$ clusters. The division is determined by the best values from a selected clustering index and its associated optimal index value, all within a predefined range for the number of clusters $k$ .
Clustering Indices: Metrics or (internal) measures used to evaluate the quality or suitability of a particular clustering scheme. Clustering indices determine the optimal number of clusters $k$ based on their optimal index value.
Optimal Index Value: The best value of a clustering index for determining the optimal number of clusters $k$ for a clustering scheme.
Best Index Value: Refers to a selection of optimal index values, used to further refine the optimal clustering schemes to identify the most suitable ones regarding the number of clusters $k$ .
Best Clustering Schemes: The most effective or suitable clustering schemes are selected from among the optimal ones, based on the best index value.

Appendix A.2. Clustering Indices

The methodology for determining the optimal number of clusters has been explored in depth in the literature, notably by [43]. Various indices have been proposed, but for this study, the most useful are presented below and summarized in Table A1.

Table A1. Indices used to define the optimal number of clusters.

Index Abbreviation.	Index in Literature	Optimal Number of Clusters Defined by	Equation
CH	Caliński and Harabasz [45]	Maximum value of the index	$C H (q) = \frac{t r a c e (B_{q}) / (q - 1)}{t r a c e (W_{q}) / (n - q)}$
FR	Friedman and Rubin [48]	Maximum difference between hierarchy levels of the index	$F R = t r a c e (W_{q}^{- 1} B_{q})$
KL	Krzanowski and Lai [46]	Maximum index value	$K L (q) = \|\frac{{D I F F}_{q}}{{D I F F}_{q + 1}}\|$
S	Silhouette [47]	Maximum index value	$S = \frac{\sum_{i = 1}^{n} S (i)}{n}, S \in [- 1,1]$

Consider a data matrix

X

with

p

variables measured across

n

independent observations. The matrix can be represented as

X_{n \times p} = \{x_{i j}\}, i = 1, 2, \dots, n, j = 1, 2, \dots, p

. The centroid of the data matrix

X

is denoted from

\bar{x}

. Let

n_{k}

to be the number of objects in cluster

C_{k}

,

c_{k}

be its centroid, and

x_{i}

be the observation

p

-dimensional vector of the

i^{t h}

object.

The within-group dispersion matrix, when the data are clustered into

q

clusters, is given by:

W_{q} = \sum_{k = 1}^{q} \sum_{i \in C_{k}} (x_{i} - c_{k}) {(x_{i} - c_{k})}^{⊤}

while the between-group dispersion matrix is:

B_{q} = \sum_{k = 1}^{q} n_{k} (c_{k} - \bar{x}) {(c_{k} - \bar{x})}^{⊤}

One of the most widely used indices for determining the optimal number of clusters is the silhouette index

S (i)

, introduced by [48]. Its computation aids in the interpretation and validation of data clusters [5,48]. The index value is calculated for every single object

i

as:

S (i) = \frac{b (i) - a (i)}{m a x a (i); b (i)}

Here

a (i)

represents the average distance of the

i^{t h}

object to all other objects within the same cluster

C_{r}

and is given by:

a (i) = \frac{\sum_{j \in \{C_{r} ∖ i\}} d_{i j}}{n_{r} - 1}

The term

b (i)

denotes the smallest average distance (dissimilarity) from the

i^{t h}

object to objects in another cluster

C_{s}

, formulated as:

b (i) = \underset{s \neq r}{m i n} \{d_{i C_{s}}\} = \underset{s \neq r}{m i n} \{\frac{\sum_{j \in C_{s}} d_{i j}}{n_{s}}\}

Two specific distances are of interest:

The within-cluster distance $α$ is the mean distance between each observation and its nearest neighbors within the same cluster.
The nearest-neighbor distance $b$ represents the mean distance between each observation and the nearest observation from a different cluster.

The optimal number of clusters is typically identified by the maximum silhouette index value [7]. The index

S (i)

is valid for

k > 1

(i.e., more than one cluster) and ranges between -1 and 1. Values near 1 suggest that an object is well-clustered, while those approaching -1 suggests the observation is in the wrong cluster. A value close to zero indicates that the observation lies between two clusters.

Another notable index is the

K L

index proposed by [47] as

{D I F F}_{q} = (q - 1)^{2 / p} trace (W_{q - 1}) - q^{2 / p} trace (W_{q})

The

{D I F F}_{q}

index is a criterion for choosing the optimal number of clusters in a dataset. The idea is to identify the value of

q

that maximizes

{D I F F}_{q}

, as this value is considered to specify the optimal number of clusters. This index is based on comparing the within-cluster dispersion for

q

clusters to that of

q - 1

clusters. Specifically, it looks at the difference between the within-cluster dispersions when there are

q

clusters and when there are

q - 1

clusters, adjusting for the number of variables and the number of clusters.

The Caliński and Harabasz (

C H

) index [46] represented by the equation

C H (q)

, is a measure used to determine the optimal number of clusters in a dataset:

C H (q) = \frac{trace (B_{q}) / (q - 1)}{trace (W_{q}) / (n - q)}

where

trace (B_{q})

is the sum of the diagonal elements of

(B_{q})

, representing the total between-cluster variance, and

trace (W_{q})

is the sum of the diagonal elements of

W_{q}

, representing the total within-cluster variance. The numerator measures the between-cluster variance normalized by

q - 1

(i.e., the degrees of freedom for the between-cluster variance). It quantifies how different the clusters are from each other. The denominator measures the within-cluster variance normalized by

n - q

(i.e., the degrees of freedom for the within-cluster variance). It quantifies how compact each cluster is. The ratio (i.e., the CH index) essentially compares between-cluster variance to within-cluster variance. A higher CH index indicates that the clusters are well-separated from each other and compact. The optimal number of clusters is the value of

q

that maximizes the

C H

index. This is because a high value of

C H (q)

implies that the between-cluster variance is much greater than the within-cluster variance, which is a desirable quality for clustering.

The Friedman-Rubin (

F R

) index is formulated as:

F R = t r a c e (W_{q}^{- 1} B_{q})

The

F R

index was introduced by [48] as a foundational metric for non-hierarchical clustering methods. The index evaluates the ratio of the between-cluster dispersion to the within-cluster dispersion. According to [79], the optimal number of clusters is indicated by the maximum difference in consecutive values of the

F R

index. This means, that rather than simply looking for a maximum value, one would analyze the differences between values for consecutive cluster numbers (e.g., the difference between the value for 4 clusters and 5 clusters). The largest jump or change in these differences would indicate the optimal number of clusters. This methodology is based on the idea that as one adds more clusters, the benefit (or difference in the index value) will start to decrease after a certain point. The most significant difference would therefore highlight where the increase in cluster count has the most substantial impact on the separation to dispersion ratio, indicating an optimal cluster count.

Appendix A.3. PCA

In Figure A1, the top section showcases a factor map, often referred to as a variable correlation plot. The relationships between clustering variables, namely height and census data, are demonstrated via a correlation circle. Within the circle, variables that are positively correlated are represented by vectors forming small angles between them.

Figure A1. PCA eigenvalue-variance of height and census data (top) and percentage explained variance for ten principal components (bottom).

Negatively correlated variables have diametrically opposed vectors. The length of each vector from the origin indicates the quality of a variable’s representation on the factor map [33]. Specifically, vectors closer to the circumference of the circle suggest that the variable is well-represented by the first two principal components (PCs) or loadings. The bottom section of Figure A1 presents a scree plot, offering a visual assessment of the importance of each PC in the PCA. The first PC holds the highest relative importance, and the initial three components together explain 86.3% of the total variance. Given the nature of L-moments—specifically L1, L2, L3, and L4—which are analogous to traditional statistical moments of central tendency (mean), dispersion (standard deviation), asymmetry (skewness), and peakedness (kurtosis), their influence on the PCs in a PCA analysis is expected to be comparable to that of central tendency. For this reason, they cannot be used as distinctive clustering variables.

Figure A2 presents the loadings for the first three PCs, derived from an analysis of height and census data. Each plot corresponds to a specific PC: the top for the first component, the middle for the second, and the bottom for the third. A 5% threshold line is drawn on each plot to identify variables with notable contributions to the respective component. Variables clearing this threshold are deemed to have a significant influence on the associated PC and are thus selected as clustering variables. Notably, within this figure, the combined variance explained by the first two PCs amounts to 72.7%, predominantly attributed to height variables (with a single exception) for the specified threshold. Census variables are derived from the third PC.

Figure A2. PCA loading plots for the first three principal components based on height and census data. Top: first PC; Middle: second PC; Bottom: third PC. The 5% threshold (red dash) line indicates significant variable contributions to each component.

Appendix A.4. Optimum Number of Clusters (k) as Influenced by Indices

Figure A3. Impact of the clustering variable on the optimal number of clusters

k

using the Silhouette index for all the algorithms, and excluding PAM (2nd clustering methodology, without the hmode variable). The minimum number of clusters

k

is set to 2.

Figure A3. Impact of the clustering variable on the optimal number of clusters

k

using the Silhouette index for all the algorithms, and excluding PAM (2nd clustering methodology, without the hmode variable). The minimum number of clusters

k

is set to 2.

Figure A3 and Figure A4 support the hypothesis that the Silhouette index is suitable for providing the optimal number of clusters

k

solely for the PAM algorithm and not for others.

Figure A4. Distribution of the optimal number of clusters

k

using the Silhouette index for all the algorithms, excluding PAM, illustrating the impact of distance on the optimal

k

. The minimum

k

was set to 8 (2nd clustering methodology, without the hmode variable).

Figure A4. Distribution of the optimal number of clusters

k

using the Silhouette index for all the algorithms, excluding PAM, illustrating the impact of distance on the optimal

k

. The minimum

k

was set to 8 (2nd clustering methodology, without the hmode variable).

Figure A5. Impact of the algorithm on the optimum number of clusters with the Friedman and Rubin (FR) index (2nd clustering methodology, without the hmode variable).

Figure A6. Impact of clustering variables on the optimum number of clusters with the Friedman and Rubin (FR) index (2nd clustering methodology, without the hmode variable).

Appendix A.5. Correlation

Figure A7. Comparison of linear correlation (blue line) with a weighted correlation by sample size (red line) for Volume (left) and mean height (right).

Appendix A.6. Clustering of the Aggregated Mean Height in FMUs

Figure A8. Distribution of auxiliary data (Mean Height) after clustering. Dots represent the values of the forest management units documenting the well-clustered results. Clusters are ordered by increasing Mean Height (hmean).

References

Chukwu, O.; Dau, J.H. Forest Inventory: Challenges, Trend, and Relevance on Conservation and Restoration of Tropical Forests. In Handbook of Research on the Conservation and Restoration of Tropical Dry Forests; IGI Global: Hershey, PA, USA, 2020; pp. 306–322. [Google Scholar]
Dau, J.H.; Mati, A.; Dawaki, S.A. Role of Forest Inventory in Sustainable Forest Management: A Review. Int. J. For. Hortic. 2015, 1, 33–40. [Google Scholar]
Rao, J.N.; Molina, I. Small Area Estimation; John Wiley & Sons, Inc.: Hoboken, NJ, USA, 2015. [Google Scholar]
Rahman, A.; Harding, A. Small Area Estimation and Microsimulation Modeling; Chapman and Hall/CRC: New York, NY, USA, 2017. [Google Scholar]
Giordani, P.; Ferraro, M.B.; Martella, F. An Introduction to Clustering with R; Springer: Berlin/Heidelberg, Germany, 2020. [Google Scholar]
Hartigan, J.A.; Wong, M.A. Algorithm AS 136: A k-means clustering algorithm. J. R. Stat. Society. Ser. C (Appl. Stat.) 1979, 28, 100–108. [Google Scholar] [CrossRef]
Kaufman, L.; Rousseeuw, P.J. Partitioning around Medoids (Program PAM). In Finding Groups in Data: An Introduction to Cluster Analysis; Kaufman, L., Rousseeuw, P.J., Eds.; John Wiley & Sons, Inc.: Hoboken, NJ, USA, 1990; pp. 68–125. [Google Scholar]
Næsset, E. Determination of Mean Tree Height of Forest Stands by Digital Photogrammetry. Scand. J. Forest Res. 2002, 17, 446–459. [Google Scholar] [CrossRef]
Immitzer, M.; Stepper, C.; Böck, S.; Straub, C.; Atzberger, C. Use of WorldView-2 stereo imagery and National Forest Inventory data for wall-to-wall mapping of growing stock. Forest Ecol. Manag. 2016, 359 (Suppl. C), 232–246. [Google Scholar] [CrossRef]
Ullah, S.; Dees, M.; Datta, P.; Adler, P.; Saeed, T.; Khan, M.S.; Koch, B. Comparing the potential of stereo aerial photographs, stereo very high-resolution satellite images, and TanDEM-X for estimating forest height. Int. J. Remote Sens. 2020, 41, 6976–6992. [Google Scholar] [CrossRef]
Strunk, J.L.; Bell, D.M.; Gregory, M.J. Pushbroom Photogrammetric Heights Enhance State-Level Forest Attribute Mapping with Landsat and Environmental Gradients. Remote Sens. 2022, 14, 14. [Google Scholar] [CrossRef]
Fay, R.E.; Herriot, R.A. Estimates of Income for Small Places: An Application of James-Stein Procedures to Census Data. J. Am. Stat. Assoc. 1979, 74, 269–277. [Google Scholar] [CrossRef]
Battese, G.E.; Harter, R.M.; Fuller, W.A. An Error-Components Model for Prediction of County Crop Areas Using Survey and Satellite Data. J. Am. Stat. Assoc. 1988, 83, 28–36. [Google Scholar] [CrossRef]
Breidenbach, J.; Magnussen, S.; Rahlf, J.; Astrup, R. Unit-level and area-level small area estimation under heteroscedasticity using digital aerial photogrammetry data. Remote Sens. Environ. 2018, 212, 199–211. [Google Scholar] [CrossRef]
Goerndt, M.E. Comparison and Analysis of Small Area Estimation Methods for Improving Estimates of Selected Forest Attributes. Ph.D. Thesis, Oregon State University, Oregon, CA, USA, 2010. [Google Scholar]
Magnussen, S.; Mauro, F.; Breidenbach, J.; Lanz, A.; Kändler, G. Area-level analysis of forest inventory variables. Eur. J. For. Res. 2017, 136, 839–855. [Google Scholar] [CrossRef]
Chandra, H.; Chandra, G. Small Area Estimation for Total Basal Cover in The State of Maharashtra in India. In Statistical Methods and Applications in Forestry and Environmental Sciences. Forum for Interdisciplinary Mathematics; Chandra, G., Nautiyal, R., Chandra, H., Eds.; Springer: Singapore, 2020. [Google Scholar]
McConville, K.S.; Moisen, G.G.; Frescino, T.S. A Tutorial on Model-Assisted Estimation with Application to Forest Inventory. Forests 2020, 11, 244. [Google Scholar] [CrossRef]
Newnham, R.M. Cluster analysis: An application in forest management planning. For. Chron. 1992, 68, 628–633. [Google Scholar] [CrossRef]
Smaltschinski, T.; Seeling, U.; Becker, G. Clustering Forest harvest stands on spatial networks for optimized harvest scheduling. Ann. For. Sci. 2012, 69, 651–657. [Google Scholar] [CrossRef]
Vega, C.; Renaud, J.-P.; Sagar, A.; Bouriaud, O. A new small area estimation algorithm to balance between statistical precision and scale. Int. J. Appl. Earth Obs. Geoinf. 2021, 97, 102303. [Google Scholar] [CrossRef]
Georgakis, A. Stratification of Forest Stands as a Basis for Small Area Estimations. In Proceedings of the 33rd PanHellenic statistics conference, Statistics in the Economy and Administration, Larissa, Greece, 23–26 September 2021. [Google Scholar]
University Forest Administration and Management Fund. Pertouli University Forest Management Plan 2019–2028; University Forest Administration and Management Fund: Thessaloniki, Greece, 2018. [Google Scholar]
Kershaw Jr, J.A.; Ducey, M.J.; Beers, T.W.; Husch, B. Forest Mensuration, 5th ed.; John Wiley & Sons: Hoboken, NJ, USA, 2016. [Google Scholar]
Hosking, J.R.M. L-Moments: Analysis and Estimation of Distributions Using Linear Combinations of Order Statistics. J. R. Stat. Soc. Ser. B Stat. Methodol. 1990, 52, 105–124. [Google Scholar] [CrossRef]
Dolloff, J.T.; Theiss, H.J. Temporal correlation of metadata errors for commercial satellite images. Presentation and effects on stereo extraction accuracy. Int. Arch. Photogramm. Remote Sens. Spat. Inf. Sci. 2012, XXXIX-B1, 215–223. [Google Scholar]
Neigh, C.S.R.; Carroll, M.L.; Montesano, P.M.; Slayback, D.A.; Wooten, M.R.; Lyapustin, A.I.; Shean, D.E.; Alexandrov, O.; Macander, M.J.; Tucker, C.J. An API for Spaceborne Sub-Meter Resolution Products for Earth Science. In IGARSS 2019–2019 IEEE International Geoscience and Remote Sensing Symposium; IEEE: Piscataway, NJ, USA, 2019; pp. 5397–5400. [Google Scholar]
Adolfsson, A.; Ackerman, M.; Brownstein, N.C. To cluster, or not to cluster: An analysis of clusterability methods. Pattern Recognit. 2019, 88, 13–26. [Google Scholar] [CrossRef]
Maechler, M. Diptest: Hartigan’s dip Test Statistic for Unimodality-Corrected. R package Version 0.75-7. 2015. Available online: https://CRAN.R-project.org/package=diptest (accessed on 13 August 2023).
Hopkins, B.; Skellam, J.G. A new method for determining the type of distribution of plant individuals. Ann.Bot. 1954, 18, 213–227. [Google Scholar] [CrossRef]
Bezdek, J.C.; Hathaway, R.J. VAT: A Tool for Visual Assessment of (Cluster) Tendency. In Proceedings of the 2002 International Joint Conference on Neural Networks, IJCNN’02 (Cat. No. 02CH37290). Honolulu, HI, USA, 12–17 May 2002. [Google Scholar]
Kassambara, A. Practical Guide To Cluster Analysis in R: Unsupervised Machine Learning; Sthda.com, 2017; Volume 1. [Google Scholar]
Kassambara, A. Practical Guide To Principal Component Methods in R: PCA, M (CA), FAMD, MFA, HCPC, Factoextra; Sthda.com, 2017; Volume 2. [Google Scholar]
McRoberts, R.E.; Gobakken, T.; Næsset, E. Post-stratified estimation of forest area and growing stock volume using lidar-based stratifications. Remote Sens. Environ. 2012, 125, 157–166. [Google Scholar] [CrossRef]
Westfall, J.A.; Patterson, P.L.; Coulston, J.W. Post-stratified estimation: Within-strata and total sample size recommendations. Can. J. For. Res. 2011, 41, 1130–1139. [Google Scholar] [CrossRef]
Scott, C.; Bechtold, W.; Reams, G.; Smith, W.; Hansen, M.; Moisen, G. Sample-based estimators used by the forest inventory and analysis national information management system. In Proceedings of the Enhanced Forest Inventory and Analysis Program—National Sampling Design and Estimation Procedures, Denver, CO, USA, 21–24 September 2004; Bechtold, W.A., Patterson, P.L., Eds.; USDA Forest Service, Southern Research Station: Asheville, NC, USA, 2005; pp. 43–67. [Google Scholar]
Bechtold, W.; Scott, C. The Enhanced Forest Inventory and Analysis Program—National Sampling Design and Estimation Procedures. In Proceedings of the Enhanced Forest Inventory and Analysis Program—National Sampling Design and Estimation Procedures, Denver, CO, USA, 21–24 September 2004; Bechtold, W.A., Patterson, P.L., Eds.; USDA Forest Service, Southern Research Station: Asheville, NC, USA, 2005; pp. 27–42. [Google Scholar]
Ruiz, L.; Hermosilla, T.; Mauro, F.; Godino, M. Analysis of the Influence of Plot Size and LiDAR Density on Forest Structure Attribute Estimates. Forests 2014, 5, 936. [Google Scholar] [CrossRef]
Chambers, R.; Clark, R. An Introduction To Model-Based Survey Sampling With Applications; OUP Oxford: Oxford, UK, 2012; Volume 37. [Google Scholar]
Magnussen, S. Arguments for a model-dependent inference? For. Int. J. For. Res. 2015, 88, 317–325. [Google Scholar] [CrossRef]
Cochran, W.G. Sampling Techniques, 3rd ed.; Wiley: New York, NY, USA, 1997. [Google Scholar]
Strunk, J.; Packalen, P.; Gould, P.; Gatziolis, D.; Maki, C.; Andersen, H.-E.; McGaughey, R.J. Large Area Forest Yield Estimation with Pushbroom Digital Aerial Photogrammetry. Forests 2019, 10, 397. [Google Scholar] [CrossRef]
Charrad, M.; Ghazzali, N.; Boiteau, V.; Niknafs, A. NbClust: An R package for determining the relevant number of clusters in a data set. J. Stat. Softw. 2014, 61, 1–36. [Google Scholar] [CrossRef]
Ward, J.H. Hierarchical Grouping to Optimize an Objective Function. J. Am. Stat. Assoc. 1963, 58, 236–244. [Google Scholar] [CrossRef]
Xu, D.; Tian, Y. A Comprehensive Survey of Clustering Algorithms. Ann. Data Sci. 2015, 2, 165–193. [Google Scholar] [CrossRef]
Caliński, T.; Harabasz, J. A dendrite method for cluster analysis. Commun. Stat. 1974, 3, 1–27. [Google Scholar]
Krzanowski, W.J.; Lai, Y.T. A Criterion for Determining the Number of Groups in a Data Set Using Sum-of-Squares Clustering. Biometrics 1988, 44, 23–34. [Google Scholar] [CrossRef]
Rousseeuw, P.J. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 1987, 20, 53–65. [Google Scholar] [CrossRef]
Friedman, H.P.; Rubin, J. On Some Invariant Criteria for Grouping Data. J. Am. Stat. Assoc. 1967, 62, 1159–1178. [Google Scholar] [CrossRef]
Dunn, J.C. Well-separated clusters and optimal fuzzy partitions. J. Cybern. 1974, 4, 95–104. [Google Scholar] [CrossRef]
Georgakis, A.; Diamantopoulou, M.J.; Trigkas, M. Methodology for the Establishment of Sample Plots and Estimation of Growing Stock Volume In Greek Forest Stands. In Proceedings of the 20th Panhellenic Forestry Conference, Trikala, Greece, 3–6 October 2021. [Google Scholar]
Mauro, F.; Molina, I.; García-Abril, A.; Valbuena, R.; Ayuga-Téllez, E. Remote sensing estimates and measures of uncertainty for forest variables at different aggregation levels. Environmetrics 2016, 27, 225–238. [Google Scholar] [CrossRef]
Team, R. R: A Language and Environment for Statistical Computing; R Foundation for Statistical Computing: Vienna, Austria, 2022; Available online: http://www.r-project.org/index.html (accessed on 13 August 2023).
Maechler, M.; Rousseeuw, P.; Struyf, A.; Hubert, M.; Hornik, K. Package Cluster: Cluster Analysis Basics and Extensions. R Package Version 2.1.4. 2022. Available online: https://CRAN.R-project.org/package=cluster (accessed on 13 August 2023).
Molina, I.; Rao, J.; Datta, G.S. Small area estimation under a Fay–Herriot model with preliminary testing for the presence of random area effects. Surv. Methodol. 2015, 41, 1–19. [Google Scholar]
Benavent, R.; Morales, D. Multivariate Fay–Herriot models for small area estimation. Comput. Stat. Data Anal. 2016, 94, 372–390. [Google Scholar] [CrossRef]
Pratesi, M.; Salvati, N. Small area estimation: The EBLUP estimator based on spatially correlated random area effects. Stat. Methods Appt. 2008, 17, 113–141. [Google Scholar] [CrossRef]
Ver Planck, N.R.; Finley, A.O.; Kershaw, J.A.; Weiskittel, A.; Kress, R.M.C. Hierarchical Bayesian models for small area estimation of forest variables using LiDAR. Remote Sens. Environ. 2018, 204, 287–295. [Google Scholar] [CrossRef]
Georgakis, A.; Stamatellos, G. Sampling Design Contribution to Small Area Estimation Procedure in Forest Inventories. Mod. Concep. Dev. Agrono. 2020, 7, 694–697. [Google Scholar] [CrossRef]
Hill, A. Integration of Small Area Estimation Procedures in Large-Scale Forest Inventories. Doctoral Dissertation, ETH Zurich, Zürich, Switzerland, 2018. Available online: http://hdl.handle.net/20.500.11850/305920 (accessed on 13 August 2023).
Hill, A.; Mandallaz, D.; Langshausen, J. A Double-Sampling Extension of the German National Forest Inventory for Design-Based Small Area Estimation on Forest District Levels. Remote Sens. 2018, 10, 1052. [Google Scholar] [CrossRef]
Mandallaz, D. Design-based properties of some small-area estimators in forest inventory with two-phase sampling. Can. J. For. Res. 2013, 43, 441–449. [Google Scholar] [CrossRef]
Molefe, W.B. Sample Design for Small Area Estimation. Doctoral Thesis, University of Wollongong, Wollongong, Australia, 2011. Available online: https://ro.uow.edu.au/theses/3495 (accessed on 13 August 2023).
Zimmermann, T. The Interplay between Sampling Design and Statistical Modelling in Small Area Estimation. Ph.D. Thesis, Trier University, Trier, Germany, 2018. [Google Scholar]
Haakana, H.; Heikkinen, J.; Katila, M.; Kangas, A. Efficiency of post-stratification for a large-scale forest inventory—Case Finnish NFI. Ann. For. Sci. 2019, 76, 9. [Google Scholar] [CrossRef]
You, Y.; Chapman, B. Small area estimation using area level models and estimated sampling variances. Surv. Methodol. 2006, 32, 97. [Google Scholar]
Georgakis, A. Further Improvements of Growing Stock Volume Estimations at Stratum-Level with the Application of Fay-Herriot Model. In Proceedings of the 33rd PanHellenic Statistics Conference, Statistics in the Economy and Administration, Larissa, Greece, 23–26 September 2021. [Google Scholar]
Zulkarnain, R.; Jayanti, D.; Listianingrum, T. Improving the quality of disaggregated SDG indicators with cluster information for small area estimates. Stat. J. IAOS 2020, 36, 955–961. [Google Scholar] [CrossRef]
Torkashvand, E.; Jozani, M.J.; Torabi, M. Clustering in small area estimation with area level linear mixed models. J. R. Stat. Soc. Ser. A Stat. Soc. 2017, 180, 1253–1279. [Google Scholar] [CrossRef]
Anisa, R.; Kurnia, A.; Indahwati, I. Cluster Information of Non-Sampled Area In Small Area Estimation. IOSR J. Math. 2014, 10, 15–19. [Google Scholar] [CrossRef]
Desiyanti, A.; Ginanjar, I.; Toharudin, T. Application of an Empirical Best Linear Unbiased Prediction Fay-Herriot (EBLUP-FH) Multivariate Method with Cluster Information to Estimate Average Household Expenditure. Mathematics 2022, 11, 135. [Google Scholar] [CrossRef]
Ginanjar, I.; Wulandary, S.; Toharudin, T. Empirical Best Linear Unbiased Prediction Method with K-Medoids Cluster for Estimate Per Capita Expenditure of Sub-District Level. IAENG Int. J. Appl. Math. 2022, 52, 1–7. [Google Scholar]
Blaschke, T. Object based image analysis for remote sensing. ISPRS J. Photogramm. Remote Sens. 2010, 65, 2–16. [Google Scholar] [CrossRef]
Jia, W.; Sun, Y.; Pukkala, T.; Jin, X. Improved Cellular Automaton for Stand Delineation. Forests 2020, 11, 37. [Google Scholar] [CrossRef]
Pukkala, T. Can Kohonen networks delineate forest stands? Scand. J. For. Res. 2021, 36, 198–209. [Google Scholar] [CrossRef]
Sun, Y.; Wang, W.; Pukkala, T.; Jin, X. Stand delineation based on laser scanning data and simulated annealing. Eur. J. For. Res. 2021, 140, 1065–1080. [Google Scholar] [CrossRef]
Pascual, A.; Tóth, S.F. Using mixed integer programming and airborne laser scanning to generate forest management units. J. For. Res. 2022, 33, 217–226. [Google Scholar] [CrossRef]
Georgakis, A.; Papageorgiou, V.E.; Stamatellos, G. Bivariate Fay-Herriot Model for Enhanced Small Area Estimation of Growing Stock Volume. In Proceedings of the International Conference on Applied Mathematics & Computer Science, IEEE Computer Society, Lefkada, Greece, 8–10 August 2023. [Google Scholar]
Milligan, G.W.; Cooper, M.C. An Examination of Procedures for Determining the Number of Clusters in a Data Set. Psychometrika 1985, 50, 159–179. [Google Scholar] [CrossRef]

Figure 1. The Pertouli Forest study area and sampling design.

Figure 2. Map of the canopy height model with white color denoting no height data. The distribution of canopy height means across forest management units (area-level data) is shown in the inset.

Figure 3. Visual illustration of mean height clusterability (left) compared to random numbers (right). Red indicates high similarity and blue low.

Figure 4. Flowchart of cluster analysis for optimal number of clusters and small area estimation.

Figure 5. Average Silhouette with values for different numbers of clusters (k) using the Partitioning Around Medoids (PAM) algorithm and various clustering variables (1st clustering methodology). Circled values indicate the optimal k.

Figure 6. Impact of the distance metric on the frequency of the optimal number of clusters, determined using the Caliński and Harabasz (CH) index without the hmode variable (2nd methodology).

Figure 7. The impact of the clustering variable on the frequency of the optimal number of clusters using the Krzanowski and Lai (KL) index (top panel). The impact of the algorithm on the frequency of the optimal number of clusters using the KL index (bottom panel) (2nd clustering methodology, without the hmode variable).

Figure 8. Direct estimates for Volume, Basal Area, Tree Density, and Mean Height were obtained using spatial consistency, the PAM algorithm, the Silhouette index, and Euclidean distance in 30 clusters. Each cluster had an average of 8 sample plots and a 10% average relative standard error (SN 6).

Figure 9. Relative Standard Error (RSE) distribution for inventory variables in SN 25 scheme. Utilizes Silhouette index, PAM algorithm, and Euclidean distance with h50 and ForestDensity97 clustering variables. Variables are ordered by decreasing median RSEs.

Figure 10. Relative Efficiency (RE) for inventory variables in clustering scheme SN 25. The scheme comprises 33/42 clusters, omitting 3 with RE > 10 and 6 single-plot clusters. Constructed using the PAM algorithm, Euclidean distance, Silhouette index, and variables h50 and ForestDensity97.

Figure 11. Scatterplot matrix for SN 25, using Silhouette index, Euclidean distance, and PAM algorithm (1st methodology). Includes vegetation height h50 clustering variable and Tree Density from past census data. Removed one-plot clusters and some outliers; 32 of 42 clusters remained. “***”, “**”, and “*” indicate p-value < 0.001, < 0.01, and < 0.05, respectively.

Table 1. Summary statistics of variables of interest from the field sample plots: unit-level data.

Variable of Interest (Measurement Unit)	Mean ± Standard Error (SE)	Relative SE % of the Mean	Minimum	Maximum
Volume (m³/ha)	303.96 ± 6.57	2.16	13.99	842.30
Basal Area (m²/ha)	32.55 ± 0.64	1.95	3.77	84.16
Tree Density (Trees/ha)	582.20 ± 1.62	2.78	170.00	1770.00
Mean Height (m)	20.16 ± 0.17	1.08	7.87	27.04

Table 2. Wall-to-wall aggregated, area-level clustering auxiliary variables (height and census) of forest management units for clustering, along with unit centroid coordinates.

Data Type	Descriptive Statistics	Abbreviation	Description	Unit Metric
Height	Quantiles	h25; h50; h75; h95	Percentiles of canopy height	Meters (m)
Height	Central tendency	hmean; hmode	Cell height mean and mode (most frequent height in a cell)	m
Height	Dispersion	hsd; hcv	Cell height standard deviation and coefficient of variation	m; ratio
Height	L-Moments	L1; L2; L3; L4	L1: mean height of all points in sample distribution; L2: similar to hsd; L3, L4: analogous to skewness/kurtosis (a measure of distribution shape)	m; m; ratio; ratio
Height	L-ratios	hLcv; hLskew	hLcv = L2/L1, similar to hcv; hLskew = L3/L2	ratio; ratio
Census	Central tendency	FirTreeDensity88ha	Hybrid fir mean tree density (1988 census)	Trees/ha
Census	Central tendency	FirGSV88ha	Hybrid fir mean volume (1988 census)	$\frac{m^{3}}{h a}$
Census	Central tendency	ForestDensity97ha	Mean all-species tree density (1997 census)	Trees/ha
Census	Central tendency	ForestGSV97ha	Mean all-species volume (1997 census)	$\frac{m^{3}}{h a}$
Geolocation	Spatial	XY centroid coordinates	FMU centroid coordinates	m
Combined variables for clustering (“_” used as variable delimiter)			hmean_hsd; hLcv_hLskew; ForestDensity97_hmean; h50_ForestDensity97; h50_ForestDensity97_X_Y; h50_X_Y

Table 3. Summary of optimal clustering schemes: methodologies,

k

-ranges, and best indices for small area estimation.

Table 3. Summary of optimal clustering schemes: methodologies,

k

-ranges, and best indices for small area estimation.

Methodology ¹	$k$ -Range	Clustering Schemes			Clustering Index ²	Number of Optimal Clustering Schemes	Best Index Value ³
Methodology ¹	$k$ -Range	Method/Algorithm	Distance metric	Variables
1st	2–50	PAM or k-medoids	Euclidean	13	S	13	13
2nd	8–50	Ward.D, Ward.D2, single, complete, average, McQuitty, median, centroid, k-means	Euclidean, maximum, Manhattan, Minkowski	13	CH	468	8
					FR	468	7
					KL	468	15
					S	468	8
Sum		1; 9 (1st; 2nd)	1; 4 (1st; 2nd)	13 (1st; 2nd)	1; 4	1885	51

¹: Methodology: 1st: 13 Clustering Schemes = PAM algorithm × Euclidean distance × 13 Variables; 2nd: 468 Clustering Schemes = 9 Algorithms × 4 Distances × 13 Variables. ²: The clustering index suggests the optimal clustering scheme or the optimal number of clusters k, after setting the minimum and maximum threshold k-Range. More details in Appendix A and Table A1. ³: The Best Index Value represents a selection of the most “optimal” indices values, which guide the identification of the most suitable clustering schemes for small area estimation applications.

Table 4. Evaluation of best clustering schemes for direct small area estimates based on growing stock volume. The evaluation criteria for clustering include the Mean of Relative Standard Errors (RSEs), the 90th percentile (p90) of RSEs, the standard deviation (StD) of the mean of RSEs, and the average number of sample plots (nPlots) per cluster within each clustering scheme solution (rows).

SN	Dist	Algorithm	Clustering Variables	Index	Best Index Value	$k$ Clusters	Mean of RSEs	StD of RSEs	p90 of RSEs	Mean of nPlots	StD of nPlots	* 1 plot Clusters
1	Man	ward.D	h50_Dens_X_Y	CH	50.47	13	0.08	0.03	0.10	18.38	10.95	0
2	Eucl	k-means	h50_X_Y	CH	81.57	8	0.07	0.03	0.10	29.88	13.27	0
3	Eucl	k-means	Hmean	S	0.61	14	0.10	0.10	0.11	17.07	8.95	0
4	Eucl	PAM ⁽¹⁾	h95	S	0.60	14	0.08	0.05	0.12	18.00	7.19	1
5	Man	ward.D	hmean_Dens	KL	2002.22	14	0.10	0.10	0.12	17.07	10.31	0
6	Eucl	PAM ⁽¹⁾	h50_Dens_X_Y	S	0.28	30	0.10	0.04	0.12	8.07	4.05	1
7	Eucl	PAM ⁽¹⁾	h25	S	0.59	13	0.09	0.06	0.12	18.08	9.37	0
8	Eucl	ward.D	h50	KL	141.49	19	0.08	0.03	0.12	14.75	7.04	3
9	Max	k-means	h50_X_Y	S	0.31	15	0.09	0.04	0.13	15.93	8.40	0
10	Eucl	average	h50_Dens_X_Y	KL	261.42	25	0.09	0.04	0.13	11.19	8.18	4
11	Eucl	ward.D	hLcv_hLskew	KL	2013.08	17	0.09	0.03	0.13	14.88	8.68	1
12	Eucl	k-means	h50	CH	715.99	19	0.11	0.08	0.13	13.22	6.61	1
13	Eucl	k-means	hLcv	CH	600.54	17	0.10	0.04	0.15	14.88	5.52	1
14	Eucl	PAM ⁽¹⁾	hmean	S	0.60	22	0.09	0.04	0.16	11.65	7.46	2
15	Eucl	k-means	h75	S	0.58	26	0.12	0.11	0.17	9.52	4.34	1
16	Eucl	k-means	hLcv_hLskew	CH	188.71	34	0.12	0.07	0.17	7.21	3.70	1
17	Eucl	McQuitty	h75	FR	13781.53	36	0.12	0.10	0.17	7.77	4.73	6
18	Eucl	PAM ⁽¹⁾	h75	S	0.58	33	0.10	0.04	0.18	8.21	4.52	5
19	Eucl	PAM ⁽¹⁾	h50	S	0.60	31	0.11	0.05	0.18	8.56	4.43	4
20	Eucl	k-means	hmean	FR	2864.02	31	0.13	0.13	0.18	8.17	4.87	2
21	Max	complete	h50_X_Y	CH	77.08	33	0.12	0.06	0.19	7.65	3.82	2
22	Eucl	complete	h50	S	0.64	35	0.13	0.07	0.19	8.03	4.35	6
23	Eucl	average	hmean_hsd	KL	574.46	26	0.11	0.07	0.19	12.83	11.75	8
24	Eucl	PAM ⁽¹⁾	hmean_Dens	S	0.40	41	0.12	0.05	0.20	6.54	3.99	6
25	Eucl	PAM ⁽¹⁾	h50_Density	S	0.43	42	0.12	0.07	0.20	6.36	3.91	6
26	Max	median	hmean_Dens	KL	540.85	28	0.13	0.07	0.23	12.78	18.02	9
27	Eucl	k-means	h50_X_Y	S	0.32	30	0.12	0.08	0.23	7.97	3.36	0
28	Eucl	single	h95	KL	168.80	23	0.14	0.12	0.23	13.71	22.48	6
29	Eucl	single	hmean	FR	6240.10	32	0.13	0.12	0.24	9.63	8.72	8
30	Eucl	median	h50_X_Y	CH	41.68	26	0.14	0.09	0.24	13.53	11.04	9
31	Eucl	PAM ⁽¹⁾	hLcv	S	0.65	34	0.14	0.09	0.25	7.28	3.53	2
32	Eucl	single	hmean	CH	1265.23	36	0.14	0.12	0.27	8.25	8.11	8
33	Eucl	ward.D	hLcv	S	0.67	39	0.16	0.10	0.27	6.41	3.69	2
34	Eucl	median	hLcv	FR	7872.32	39	0.16	0.12	0.32	6.56	4.95	3

⁽¹⁾: PAM is accompanied always by Euclidean distance and Silhouette index. * Clusters containing only a single plot were excluded, as they cannot produce sampling variance, rendering them incompatible with both direct and model-based small area estimates. Shortcuts: “Dens” or ForestDensity97 refers to the 1997 census Tree Density variable, “Dist” for Distance, “Eucl” for Euclidean, “Man” for Manhattan, and “Max” for Maximum.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Georgakis, A.; Gatziolis, D.; Stamatellos, G. A Primer on Clustering of Forest Management Units for Reliable Design-Based Direct Estimates and Model-Based Small Area Estimation. Forests 2023, 14, 1994. https://doi.org/10.3390/f14101994

AMA Style

Georgakis A, Gatziolis D, Stamatellos G. A Primer on Clustering of Forest Management Units for Reliable Design-Based Direct Estimates and Model-Based Small Area Estimation. Forests. 2023; 14(10):1994. https://doi.org/10.3390/f14101994

Chicago/Turabian Style

Georgakis, Aristeidis, Demetrios Gatziolis, and Georgios Stamatellos. 2023. "A Primer on Clustering of Forest Management Units for Reliable Design-Based Direct Estimates and Model-Based Small Area Estimation" Forests 14, no. 10: 1994. https://doi.org/10.3390/f14101994

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Primer on Clustering of Forest Management Units for Reliable Design-Based Direct Estimates and Model-Based Small Area Estimation

Abstract

1. Introduction

2. Materials and Methods

2.1. Study Area

2.2. Data

2.2.1. Sampled Data and Variables of Interest

2.2.2. Clustering Variables

2.3. Methodology

2.3.1. Clusterability and Variable Selection

2.3.2. Optimal Number of Clusters and Sample Size per Cluster

2.3.3. Optimal and Best Clustering Schemes

2.3.4. Preprocessing and Dissimilarity Metrics

2.3.5. Clustering Algorithms

2.3.6. Clustering Indices

2.3.7. Evaluation of Design-Based Direct Estimates

2.3.8. Model-Based SAE: Correlation Analysis

3. Results

3.1. Analysis of Optimal Clustering Schemes

3.2. Design-Based Direct SAE

3.3. Correlation Analysis for Model-Based SAE

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A

Appendix A.1. Terminology

Appendix A.2. Clustering Indices

Appendix A.3. PCA

Appendix A.4. Optimum Number of Clusters (k) as Influenced by Indices

Appendix A.5. Correlation

Appendix A.6. Clustering of the Aggregated Mean Height in FMUs

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI