1.1. Interpolation and Extrapolation
Interpolation and extrapolation serve as two primary frameworks in supervised learning algorithms ranging from function approximation to deep learning [
1], with applications spanning engineering [
2], science [
3], economics [
4], and statistics [
5]. Interpolation predicts a new sample’s target value based on known data points within a specified range [
6]. Minda et al. [
7] compared the most common interpolation methods. It is vital to note that to make interpolation applicable, the new observation must lie within the known sample space. For instance,
k-nearest neighbor (
kNN) is a good example method of interpolation. It finds a sample’s nearest neighbors in a local subspace that centers around the sample under the defined distance metric (e.g., Euclidean distance). As shown in
Figure 1, the target value of a new observation
, which falls within the known data range, can be approximated as
, where
,
, and
are the three nearest training points to
. For interpolation, a critical consideration is selecting the appropriate interpolation function. Over the years, various functions have emerged, including linear interpolation, polynomial interpolation, and spline interpolation [
8], which have varying properties including accuracy, computational cost, data point requirements, and functional smoothness. Linear interpolation, while simple and fast, may not capture complex relationships between features and targets [
9]. Polynomial interpolation, utilizing the lowest-degree polynomial to fit all data points, includes methods like Newton and Hermit interpolations [
10]. Challu et al. [
11] proposed neural hierarchical interpolation for time series (NHITS), a model integrating hierarchical interpolation and multi-rate data sampling methods. Sekulić et al. [
12] investigated the significance of incorporating observations from the nearest locations along with their distances from the target prediction site through the implementation of random forest spatial interpolation (RFSI). Although polynomial interpolation can offer higher precision than linear interpolation, it is computationally intensive and may exhibit oscillations. Nearest-neighbor interpolation, a zero-order polynomial interpolation, assigns the value of an interpolated point based on its nearest existing data point(s).
Extrapolation is inherently more challenging than interpolation, as it predicts outside the known data space. Linear extrapolation posits a linear relationship between features and targets, offering simplicity but sometimes missing underlying distribution complexities. As shown in
Figure 2a, using a function
fitted for known points via linear extrapolation techniques, the target value of the new observation
, which falls outside the known data space, can be approximated as
. Polynomial extrapolation can fit non-linear data effectively, as shown in
Figure 2b. Selecting the appropriate extrapolation method requires understanding the data’s inherent characteristics, e.g., whether they are continuous, smooth, or periodic. Incorporating domain knowledge often proves valuable for extrapolations [
13]. Webb et al. [
14] addressed the challenge of learning representations that facilitate extrapolation and proposed a novel visual analogy benchmark that enables a graded assessment of extrapolation based on the distance from the convex domain defined by the training dataset. Zhu et al. [
15] systematically explored the extrapolation behavior of deep operator networks through rigorous quantification of extrapolation complexity and proposed a novel strategy involving bias–variance trade-off for extrapolation.
The effectiveness of extrapolation relies on the assumption about the functional form [
16]. In
Figure 3, which illustrates three known data points (from
to
), the true curve is a third-order polynomial (solid black line), but the polynomial extrapolation wrongly assumes a quadratic polynomial curve (black dashed line). This underscores that extrapolation is inherently uncertain, with a heightened risk of yielding misleading results. Such issues are optimally mitigated when the functional forms assumed by the extrapolation technique closely mirror the underlying nature of the data.
Interpolation and extrapolation can be viewed as linear approximation methods within the unit disk of the complex plane [
17]. The most effective methods identified for interpolation and extrapolation include widely adopted techniques such as cubic spline interpolation and Gaussian processes regression [
18]. Rosenfeld et al. [
19] provided a rigorous demonstration that extrapolation poses significantly greater computational challenges than interpolation based on reweighting of sub-group likelihoods, while the statistical complexity remains relatively unchanged.
Interpolation and extrapolation, while serving distinct roles, are both crucial for making predictions from data. Interpolation is primarily employed to fill gaps in existing records, acting as a bridge to seamlessly integrate missing data within known boundaries, and kNN serves as a predictive model with good interpolation abilities. On the other hand, extrapolation goes beyond these bounds, making predictions for entirely new observations based on the trends and patterns identified in the existing dataset, and linear regression serves as a predictive model with good extrapolation abilities. The accuracy and efficacy of these methods, however, are heavily influenced by the context in which they are used.
When working with a univariate feature variable, classifying a new data point as either interior or exterior to the known dataset is relatively straightforward. If the point falls within the dataset’s range, interpolation is the method of choice. Conversely, if it lies outside this range, extrapolation should be employed. However, the task becomes more difficult with a multivariate feature vector. In such cases, the task of determining whether a new data point is interior or exterior to the existing feature space grows complex. The presence of multiple dimensions can lead to scenarios where a point might be deemed interior in one feature dimension but exterior in another. Consequently, this complexity gives rise to a pressing research question: How can the intricacies of multivariate data be effectively dealt with by leveraging the strengths of both interpolation and extrapolation while mitigating their limitations?
1.2. Contributions and Organization
To address the above research question, we establish a mathematical programming model to classify whether a new multivariate data point is interior or exterior to the known dataset. By solving the established optimization model, we obtain the defined centrality coefficient of the new data point. Accordingly, we propose a novel hybrid prediction framework that integrates both interpolation and extrapolation methods by taking advantage of the centrality coefficient. If the new observation is an interior point to the known dataset, we can use prediction methods with good interpolation abilities, such as kNN. Otherwise, we can use prediction methods with good extrapolation abilities, such as linear regression. Consequently, our hybrid prediction framework takes advantage of both interpolation and extrapolation abilities.
Our framework distinguishes itself from the existing interpolation and extrapolation methods in several ways:
It can handle both interior and exterior data points without prior knowledge or assumptions;
It flexibly selects the optimal prediction strategy by considering the centrality coefficient obtained from the optimization model;
It enhances the precision of predictions by harnessing the collective power of both interpolation and extrapolation abilities.
As a practical application, we harness our framework to address the ship deficiency prediction problem using the port state control (PSC) inspection dataset for the port of Hong Kong. A comparative analysis against the simple uses of kNN and linear regression reveals that our model excels in specific scenarios. This paper, therefore, stands as a valuable addition to the literature, offering a refreshed and effective method that melds the advantages of both interpolation and extrapolation.
The remainder of this paper is organized as follows.
Section 2 presents our optimization model for classifying exterior or interior data points and describes our hybrid framework combining
kNN and linear regression.
Section 3 describes the numerical experiments within the considered case study, focusing on ship deficiency prediction.
Section 4 concludes our paper and suggests future research directions.