Reconciling Functional Data Regression with Excess Bases

Tomoya Wakayama Graduate School of Economics, The University of Tokyo
7-3-1 Hongo, Bunkyo-ku, Tokyo, Japan and Hidetoshi Matsui Faculty of Data Science, Shiga University
1-1-1 Banba, Hikone, Shiga, Japan

(Date: July 7, 2024, Contact: [email protected])

Abstract.

As the development of measuring instruments and computers has accelerated the collection of massive amounts of data, functional data analysis (FDA) has experienced a surge of attention. The FDA methodology treats longitudinal data as a set of functions on which inference, including regression, is performed. Functionalizing data typically involves fitting the data with basis functions. In general, the number of basis functions smaller than the sample size is selected. This paper casts doubt on this convention. Recent statistical theory has revealed the so-called double-descent phenomenon in which excess parameters overcome overfitting and lead to precise interpolation. Applying this idea to choosing the number of bases to be used for functional data, we show that choosing an excess number of bases can lead to more accurate predictions. Specifically, we explored this phenomenon in a functional regression context and examined its validity through numerical experiments. In addition, we introduce two real-world datasets to demonstrate that the double-descent phenomenon goes beyond theoretical and numerical experiments, confirming its importance in practical applications.

Keywords. Basis expansion; Double-descent; Functional data regression; Minimum norm interpolator

1. Introduction

Functional data analysis (FDA) has emerged as a powerful tool for analyzing longitudinal data across diverse fields, including biology, medicine, economics, and the social sciences (Ramsay and Silverman, 2005; Horváth and Kokoszka, 2012; Kokoszka and Reimherr, 2017; Wang et al., 2016). The fundamental concept of FDA is to represent the longitudinally measured data for each individual as a smooth function and then analyze the collection of functions using various statistical techniques (Hsing and Eubank, 2015). This approach offers several advantages, such as reducing observational errors through smoothing and accommodating varying time points and numbers of observations for different subjects (e.g., Wakayama and Sugasawa, 2024).

In FDA, basis expansion is a widely used technique for transforming longitudinal data into functional data (Fujii and Konishi, 2006; Araki et al., 2009). Basis expansion is known for its ability to smooth noisy data and reveal the underlying structure (Green and Silverman, 1993; Hastie et al., 2009). In numerous FDA methodologies, such as functional regression and time series analysis, selecting the number of basis functions is a pivotal issue due to its substantial impact on prediction accuracy. The number of bases is selected from a range of values smaller than the number of observation points using information criteria (Akaike, 1973; Schwarz, 1978; Konishi and Kitagawa, 1996) or by employing cross-validation (Stone, 1974). This practice aims to avoid overfitting, i.e., it seeks to mitigate the explosion of interpolated values between observation points. However, recent developments in statistical theory suggest that this approach may need to be reconsidered to achieve better prediction performance.

Overfitting has long been a challenge in FDA; however, recent statistical theory has begun to reconcile this issue. Indeed, Zhang et al. (2021) empirically showed that deep neural network models with a large number of parameters that perfectly fit the training data can yield near-optimal accuracy for the test data. This phenomenon is referred to as the double-descent phenomenon (Belkin et al., 2018, 2019), where the interpolation error follows a conventional U-shaped curve up to a threshold, but decreases after reaching a peak at the threshold. In addition, Hastie et al. (2022); Belkin et al. (2020) theoretically revealed that the double-descent phenomenon can occur for linear regression models in several situations and showed the phenomenon empirically. For more detailed explanations, see James et al. (2021); Schaeffer et al. (2024); Misiakiewicz and Montanari (2023) and references therein. Further, James et al. (2021) demonstrated the double-descent phenomenon through a simple spline fitting. Figure 1 illustrates the phenomenon through fitting curves with measurement points. The figures on the left depict $15$ numerically generated data points and the spline curves fitted with the minimum norm interpolator (Hastie et al., 2022; Bartlett et al., 2020) to estimate the parameters in the model for four different numbers of basis functions. A detailed description of the methodology is referred to in Section 2. The right panel displays the mean squared errors in relation to the number of basis functions. When the number of bases equals the number of measurements, the spline curve appears overly undulating, which causes the mean squared error to explode. However, as the number of bases increases, the fitted curve becomes less undulating and the mean squared error decreases again. This suggests that using a large number of basis functions, especially a number larger than the sample size, may improve the accuracy of functional data analysis techniques.

Refer to caption — Figure 1. Left: Curve fits when the number of bases is $4$ (upper left), $20$ (upper right), $40$ (lower left), and $120$ (lower right). Right: MSE for varying number of bases.

In this paper, we advocate the use of a large number of basis functions, in combination with the minimum norm interpolator, to transform observed longitudinal data into functional data. Additionally, we apply the minimum norm interpolator to estimate functional regression models, which represent relationships between predictors and responses, either or both of which are given as functional data. We discuss four representative functional data regression scenarios where double descent is particularly relevant. We examine the effectiveness of the proposed approach within the four scenarios through simulation studies and applications to real-world datasets.

The remainder of the paper is organized as follows. Section 2 introduces functionalization with an excess number of basis functions. In Section 3, we discuss regression methods for functional data and their relation to the double-descent phenomenon. We validate our approach through numerical experiments in Section 4. Section 5 demonstrates the importance of our advocations through applications to real datasets. Finally, we summarize our main points and suggest future research directions in Section 6.

2. Functionalization

Functionalization is a crucial first step in functional data analysis. Without appropriate functionalization, extracting meaningful descriptive statistics or reaching accurate inferential conclusions becomes challenging in regression and classification. The process of functionalization involves transforming discrete, noise-corrupted observations into smooth functions that capture the underlying patterns and trends in the data (Ramsay and Silverman, 2005).

Suppose we have $N$ sets of time-course observations, where the $i$ -th subject has $M_{i}$ observations $\{x_{i1},x_{i2},\ldots,x_{iM_{i}}\}$ at time points $\{t_{i1},t_{i2},\ldots,t_{iM_{i}}\}$ $(i=1,2,\ldots,N)$ , respectively, and $t_{ij}$ are elements of a domain $\mathcal{T}\subset\mathbb{R}$ . We then consider transforming the time-course data into functions using the basis expansions (Ramsay and Silverman, 2005; Wang et al., 2016). Let $\{\phi_{k}:\mathcal{T}\to\mathbb{R}\}_{k=1}^{K}$ be a set of $K$ basis functions. We assume that each observation $x_{ij}$ can be expressed by the following regression form:

x_{ij}=\sum_{k=1}^{K}w_{ik}\phi_{k}(t_{ij})+\varepsilon_{ij}=\bm{w}_{i}^{\top}% \bm{\phi}(t_{ij})+\varepsilon_{ij}\quad(j=1,\ldots,M_{i}),

(1)

where $\bm{w}_{i}=(w_{i1},w_{i2},\ldots,w_{iK})^{\top}$ is a vector of coefficients, $\bm{\phi}(t)=(\phi_{1}(t),\phi_{2}(t),\ldots,\phi_{K}(t))^{\top}$ is a vector of basis functions, and $\varepsilon_{i1},\ldots,\varepsilon_{iM_{i}}$ are independent noise terms with mean $0$ and variance $\sigma_{i}^{2}$ . Common choices for basis functions include the Fourier basis, spline basis, and wavelet basis (Ramsay and Silverman, 2005).

We then calculate the optimal coefficient vector $\bm{w}_{i}$ . Using the notation $\bm{x}_{i}=(x_{i1},x_{i2},\ldots,x_{iM_{i}})^{\top}$ , $\Phi=(\bm{\phi}(t_{i1}),\bm{\phi}(t_{i2}),\ldots,\bm{\phi}(t_{iM_{i}}))^{\top}$ , and $\bm{\varepsilon}_{i}=(\varepsilon_{i1},\varepsilon_{i2},\ldots,\varepsilon_{in% _{i}})^{\top}$ , the regression model (1) can be expressed as $\bm{x}_{i}=\Phi\bm{w}_{i}+\bm{\varepsilon}_{i}$ . We estimate $\bm{w}_{i}$ using the minimum norm interpolator (Hastie et al., 2022; Bartlett et al., 2020):

\displaystyle\operatornamewithlimits{argmin}_{\bm{w}_{i}\in\mathbb{R}^{K}}\|% \bm{w}_{i}\|\quad{\rm s.t.}\quad\bm{w}_{i}\quad{\rm minimizes}\quad\|\bm{x}_{i% }-\Phi\bm{w}_{i}\|,

where $\|\cdot\|$ denotes the Euclidean norm. The solution to the above optimization problem is explicitly given by

\displaystyle\widehat{\bm{w}}_{i}=(\Phi^{\top}\Phi)^{\dagger}\Phi^{\top}\bm{x}% _{i},

(2)

where $(\Phi^{\top}\Phi)^{\dagger}$ denotes the Moore-Penrose pseudo-inverse matrix (e.g., Banerjee and Roy, 2014) of $\Phi^{\top}\Phi$ . Using the estimated coefficients $\widehat{\bm{w}}_{i}$ , we express the functional representation of the $i$ -th subject’s data as $x_{i}(t)=\widehat{\bm{w}}_{i}^{\top}\bm{\phi}(t)$ .

Regarding the choice of the number of basis functions $K$ , traditional approaches often select $K$ to be smaller than the number of observations $M_{i}$ to avoid overfitting (Ramsay and Silverman, 2005). However, recent theoretical evaluations by Hastie et al. (2022) suggest that using a larger number of parameters (bases, in this context) can be beneficial in cases where the noise level is low and the model is misspecified. In light of these insights, we propose using an excess number of basis functions, combined with the minimum norm interpolator, for functionalization in FDA. This approach has the potential to capture more complex patterns in the data and improve the accuracy of interpolations or subsequent analyses, especially in low-noise settings or when the true underlying function does not perfectly align with the chosen basis.

3. Functional Regression Model

In this section, we construct estimators through basis expansions for three standard models.

3.1. Scalar on Function Regression

Consider an independently and identically distributed dataset $\mathcal{D}:=\{x_{i},y_{i}\}_{i=1}^{N}$ , with explanatory function $x_{i}(\cdot)\in L_{2}(\mathcal{S})$ on domain $\mathcal{S}\subset\mathbb{R}$ and scalar response variable $y_{i}\in\mathbb{R}$ . Suppose that predicting the response $y$ when a new $x$ is observed is of interest. We employ the following scalar-on-function regression model (SonF, Hastie and Mallows, 1993; Müller, 2005; Araki et al., 2009):

\displaystyle y_{i}=\int_{\mathcal{S}}x_{i}(s)\beta(s)ds+\varepsilon_{i},

(3)

where $\beta\in L_{2}(\mathcal{S})$ is a functional coefficient and $\varepsilon_{i}$ is an error term with mean zero and finite variance. This model assumes a linear relationship between the functional predictor $x_{i}$ and the scalar response $y_{i}$ , mediated by the functional coefficient $\beta$ .

We can represent $x_{i}(s)$ and $\beta(s)$ using basis expansions:

\displaystyle x_{i}(s)

\displaystyle=\sum_{k=1}^{K}w_{ik}\phi_{k}(s),~{}~{}\mathrm{and}~{}~{}\beta(s)% =\sum_{k=1}^{K}b_{k}\phi_{k}(s),

where $\phi_{k}$ are the basis functions, $w_{ik}$ and $b_{k}$ are corresponding coefficients for $x_{i}$ and $\beta$ , respectively, and $K$ is the number of basis functions. The coefficients $w_{ik}$ are obtained using the minimum norm interpolator (2); therefore, the $w_{ik}$ are known here. For notational simplicity, we write the above expansion in vector form as

\displaystyle x_{i}(s)=\bm{w}_{i}^{(K)\top}\bm{\phi}^{(K)}(s),~{}~{}\mathrm{% and}~{}~{}\beta(s)=\bm{b}^{(K)\top}\bm{\phi}^{(K)}(s),

(4)

where $\bm{\phi}^{(K)}(s):=(\phi_{1}(s),\ldots,\phi_{K}(s))^{\top}$ , $\bm{w}_{i}^{(K)}:=(w_{i1},\ldots,w_{iK})^{\top}$ and $\bm{b}^{(K)}:=(b_{1},\ldots,b_{K})^{\top}$ . The upper subscripts of the vectors are added to explicitly represent the number of bases.

Using the above expansion, we can rewrite (3) as

	$\displaystyle y_{i}$	$\displaystyle=\bm{w}_{i}^{(K)\top}\Phi^{(K)}\bm{b}^{(K)}+\varepsilon_{i}$
		$\displaystyle=\bm{z}_{i}^{\top}\bm{b}^{(K)}+\varepsilon_{i},$		(5)

where $\Phi^{(K)}$ denotes the $K\times K$ matrix, whose $(i,j)$ -th entry is $\int_{\mathcal{S}}\phi_{i}(s)\phi_{j}(s)ds$ , and $\bm{z}_{i}=\Phi^{(K)}\bm{w}_{i}^{(K)}$ . Then, the joint equation for all observations can be written as

\displaystyle\bm{y}=Z\bm{b}^{(K)}+\bm{\varepsilon},

(6)

where $\bm{y}=(y_{1},y_{2},\ldots,y_{N})^{\top}$ , $Z^{\top}=(\bm{z}_{1}^{\top},\bm{z}_{2}^{\top},\ldots,\bm{z}_{N}^{\top})^{\top}$ and $\bm{\varepsilon}=(\varepsilon_{1},\varepsilon_{2},\ldots,\varepsilon_{N})^{% \top}.$

When $K<N$ , the ordinary least squares estimator $(Z^{\top}Z)^{-1}Z^{\top}\bm{y}$ can be used to estimate $\bm{b}^{(K)}$ . However, we are interested in the case where $K$ can be larger than $N$ , and $Z^{\top}Z$ is not invertible. Then, we introduce the minimum norm interpolator:

\displaystyle\operatornamewithlimits{argmin}_{\bm{b}^{(K)}}\|\bm{b}^{(K)}\|% \quad{\rm s.t.}\quad\bm{b}^{(K)}\quad{\rm minimizes}\quad\|\bm{y}-Z\bm{b}^{(K)% }\|,

which is equivalent to

\displaystyle\widehat{\bm{b}}^{(K)}=(Z^{\top}Z)^{\dagger}Z^{\top}\bm{y}.

(7)

In other words, we adopt $Z\widehat{\bm{b}}^{(K)}$ as the predictor of the new observations.

Since, in real measurements, data are observed at a finite number of discrete time points, we need to take that number into account. Here, for brevity, the number of observation points is assumed to be common across all individuals. Let $M$ be the number of $x$ observation points (it should be noted that the following discussion can be extended in a straightforward way to the case in which the number of observations is heterogeneous). Since $M$ controls the information contained in the regression model, it will have a significant impact on prediction accuracy.

Now, for precise prediction, we explore the way to select the number of bases, which is the only value that the analysts can control. To investigate the relationship between the number of basis functions $K$ , the sample size $N$ , and the number of observation points $M$ , and their impact on the double-descent phenomenon, we consider two scenarios:

(A)

$N<M$ : If $1\leq K<M$ , the model in (6) is a regression problem with sample size $N$ and number of parameters $K$ . As $K$ gradually increases from 1, a double-descent phenomenon with a peak at $K=N$ will be observed. This can be understood by regarding the original regression as an over-parameterized linear regression.
(B)

$M<N$ : In this case, since $\mathrm{rank}Z~{}(\leq M)$ is less than $N$ , the double-descent with respect to $N$ does not occur. Since the expressive power of the model in (6) is limited to less than the number of observation points if $M$ is small, accuracy will reach a ceiling even when the number of bases is increased.

The model considered here is a simple linear regression model, and the concern in such a case is model misspecification. In real data analysis, the true functional data (i.e., the data generating process) is unknown, and there are features that cannot be captured by a finite set of basis functions chosen arbitrarily by the analyst. For example, approximating a function with a few dozen spline bases may not describe periodicity or the variation of spikes. In a rough sense, Equation (5) is considered a misspecified model. However, as stated in Section 5 of Hastie et al. (2022), even if the model is misspecified, increasing the dimension of the parameters will contribute to improved prediction accuracy. This implies that increasing the number of basis functions is also robust to model misspecification, providing further motivation for the use of excess basis functions in functional regression.

3.2. Function on Function Regression

Consider an independent and identically distributed dataset $\mathcal{D}:=\{x_{i},y_{i}\}_{i=1}^{N}$ , where $x_{i}(\cdot)\in L_{2}(\mathcal{S})$ is an explanatory function on domain $\mathcal{S}\subset\mathbb{R}$ , and $y_{i}(\cdot)$ is a response function on domain $\mathcal{T}\subset\mathbb{R}$ . Our goal is to predict the response function $y$ when a new function $x$ is observed. We adopt the following function-on-function regression model (FonF, Ramsay and Dalzell, 1991; Matsui et al., 2009):

\displaystyle y_{i}(t)=\int_{\mathcal{S}}\beta(s,t)x_{i}(s)ds+\varepsilon_{i}(% t),

(8)

where $\beta(s,t)$ is a bivariate functional coefficient, and $\varepsilon_{i}(t)$ is an error process with mean zero and constant variance $\sigma^{2}$ . This model assumes a linear relationship between the functional predictor $x_{i}$ and the functional response $y_{i}$ , mediated by the bivariate functional coefficient $\beta$ .

Using basis expansion, as in Equation (4), we can represent the functional predictor, the bivariate functional coefficient, and the functional response as

x_{i}(s)=\bm{w}_{i}^{(K_{1})\top}\bm{\phi}^{(K_{1})}(s),\ \ \ \beta(s,t)=\bm{% \phi}^{(K_{1})\top}(s)B\bm{\psi}^{(K_{2})}(t),\ \ \ y_{i}(t)=\bm{v}_{i}^{(K_{2% })\top}\bm{\psi}^{(K_{2})}(t),

where $\bm{v}_{i}^{(K_{2})}=(v_{i1},\ldots,v_{iK_{2}})^{\top}$ is the coefficient vector of the bases $\bm{\psi}^{(K_{2})}(t)=(\psi_{1}(t),\ldots,\psi_{K_{2}}(t))^{\top}$ , and $B$ is the coefficient matrix of $\bm{\phi}^{(K_{1})}(s)$ and $\bm{\psi}^{(K_{2})}(t)$ . Here the coefficients $w_{ik}$ $(k=1,2,\ldots,K_{1})$ and $v_{il}$ $(l=1,2,\ldots,K_{2})$ are obtained using the minimum norm interpolator, as described in Equation (2). Substituting the basis function expansions into Equation (8), we obtain

\displaystyle\bm{v}_{i}^{(K_{2})\top}\bm{\psi}^{(K_{2})}(t)=\bm{w}_{i}^{(K_{1}% )\top}\Phi^{(K_{1})}B\bm{\psi}^{(K_{2})}(t)+\varepsilon_{i}(t).

(9)

To estimate the coefficient matrix $B$ , we consider solving the following minimization problem:

\displaystyle\operatornamewithlimits{argmin}_{B\in\mathbb{R}^{K_{1}\times K_{2% }}}\|\operatorname{vec}(B)\|\quad{\rm s.t.}\quad B\quad{\rm minimizes}\quad\|V% \bm{\psi}^{(K_{2})}(t)-ZB\bm{\psi}^{(K_{2})}(t)\|_{L_{2}},

where $V=(\bm{v}_{1}^{(K_{2})},\bm{v}_{2}^{(K_{2})},\ldots,\bm{v}_{N}^{(K_{2})})^{\top}$ , $\operatorname{vec}(\cdot)$ is the vectorization operator of a matrix and $\|\cdot\|_{L_{2}}$ is $L_{2}$ norm. Then, minimizing the least square error yields

\displaystyle\operatorname{vec}(\widehat{B})=(\Psi\otimes Z^{\top}Z)^{\dagger}% \operatorname{vec}(Z^{\top}V\Psi),

(10)

where $\Psi$ is a $K_{2}\times K_{2}$ matrix whose $(i,j)$ -th entry is $\int_{\mathcal{T}}\psi_{i}(t)\psi_{j}(t)dt$ . We consider this to be an estimator for the FonF problem.

In practice, the functional predictor and response are observed at a finite number of discrete time points. Let $M_{1}$ and $M_{2}$ be the number of time points for $x$ and $y$ , respectively, assumed, for simplicity, to be the same across individuals. The dimensions of the observed data can affect the properties of the estimator. There are many possible combinations of the sample size $N$ , the number of observation points $M_{1}$ and $M_{2}$ , and the number of basis functions $K_{1}$ and $K_{2}$ . However, two scenarios are particularly relevant to the double-descent phenomenon:

(C)

$M_{2}$ and $K_{2}$ : The parameter $K_{2}$ directly influences the prediction of the function $y$ . Based on the idea that a function can be predicted with good accuracy if the unobserved parts are properly interpolated, increasing $K_{2}$ beyond $M_{2}$ may lead to the double-descent phenomenon in terms of prediction accuracy. In other words, the phenomenon can be attributed to the accuracy of the functionalization of the response.
(D)

$N$ and $K_{1}$ : Following the same principle as (A) in the previous section, by increasing the number of basis functions for $x$ beyond the sample size $N$ , a double-descent phenomenon can be observed as long as $M_{1}>N$ . This corresponds to interpolating unobserved parts of the functional predictor using excess basis functions.

The double-descent phenomenon in FonF model can manifest in two ways: through the functionalization of the response (scenario C) and through the interpolation of the functional predictor (scenario D). By using excess basis functions in both the predictor and response expansions, we may be able to capture more complex patterns in the functional data and improve the accuracy of the functional regression model, even when the number of basis functions exceeds the number of observation points or the sample size. This further motivates the use of excess basis functions in functional regression settings.

4. Numerical Experiments

4.1. SonF Regression

As discussed at the end of Section 3.1, the accuracy of our predictions in SonF regression can be influenced by the various interrelationships among the sample size $N$ , the number of observation points $M$ , and the number of basis functions $K$ . We investigated the prediction performances for scenarios (A) and (B) as described in Section 3.1. Table 1 summarizes the simulation settings. Although multiple criteria have been devised for basis selection, we conduct experiments with the number of bases selected through five-fold cross-validation (CV, Stone, 1974), selected by corrected AIC (cAIC, Sugiura, 1978; Bedrick and Tsai, 1994), and fixed at a value of $50$ . Note that when cAIC is used, the error terms of the regression model are assumed to be independent Gaussian.

Table 1. Summary of simulation settings and representations for scalar-on-function regression.

Symbol	Description	Scenario (A)	Scenario (B)
$N$	Size of training dataset	Variable	Fixed ( $50$ )
$N_{\text{test}}$	Size of test dataset	Fixed ( $150$ )	Fixed ( $150$ )
$M$	Number of measurements for $x$	Fixed ( $75$ )	Variable
$K$	Number of bases for $x$	Variable	Variable

Scenario (A)

Consider the situation where the number of observation points $M$ is larger than the sample size $N$ , discussed in Section 3.1. First, we present the data-generating process. The functions $x_{i}(s)$ and $\beta(s)$ are produced by Gaussian processes (GPs) with the radial basis function kernel (RBF, Rasmussen and Williams, 2006) $k(x_{1},x_{2})=\theta^{2}\exp(-\|x_{1}-x_{2}\|^{2}/h^{2})$ , whose hyperparameters are set to $(\theta,h)=(10,10)$ and $(15,10)$ , respectively. The generated $x_{i}(s)$ are then centered to have a mean of $0$ . We then generate $y_{i}$ by adding a standard normal noise to the integral of the product of $x_{i}(s)$ and $\beta(s)$ . The observation vectors $\{\bm{x}_{i}\}$ are derived by selecting $M=75$ random points from the functions plus a standard normal noise $N(\bm{0},I_{M})$ . We set the training data size to $N=5,10$ and $20$ .

For each $N$ , we used the above procedure to generate $50$ datasets, each with $N$ observations as a training set and $150$ data points as a test set, and then analyzed each dataset using natural splines (Wood, 2017; R Core Team, 2024) and (7). Specifically, for $N$ observations, we calculated (7), varying the number of bases $K$ from $4$ to $50$ . To assess the performance of the model, we computed the mean squared error (MSE) of the predictions from the true signal for the $150$ test data and analyzed the changes in MSE as $K$ increased.

Table 2. MSEs of scalar-on-function regressions for different basis selection methods, averaged over

50

simulated datasets.

	Scenario (A)			Scenario (B)
Method	$N=5$	$N=10$	$N=20$	$M=5$	$M=10$	$M=20$
CV	21.992	9.650	8.654	27.549	18.852	4.833
Fixed	39.805	20.473	8.853	27.549	18.874	5.120
cAIC	54.452	22.387	8.950	28.608	18.882	5.490

The left panel in Figure 2 illustrates, for one representative dataset, how the number of bases $K$ affects predictions when $M$ is large. Initially, the MSE increases rapidly as $K$ approaches the sample size $N$ ; however, it peaks and begins to decrease when $K$ becomes larger than $N$ , exhibiting the double-descent phenomenon. Next, observe the quantitative evaluation in Table 2, whose entries represent the average values of the MSEs over $50$ datasets. Note that since cAIC assumes a situation where the degrees of freedom are smaller than $N$ , the optimal number of basis functions selected is found before the peak. However, the prediction accuracy of the predictor with a fixed number of basis functions ( $K=50$ ) is superior to the case in which the number of basis functions is selected using cAIC. For CV, which solely considers the goodness of fit of the predictions, the prediction accuracy after the peak is better than before the peak, as can be seen in the right panel of Figure 2. These findings indicate that choosing a number of basis functions that is larger than the sample size is preferable in this scenario.

Scenario (B)

Next, we focus on the situation where the number of observation points $M$ is smaller than the sample size $N$ . The functions $\bm{x}_{i}$ and $\beta$ and the response $y_{i}$ were generated in the same manner as in the previous scenario. In this setting, we generated $N=50$ data points for the training dataset and $150$ data points for the test dataset, with $M$ taking on the values $5$ , $10$ , and $20$ .

We produced $50$ datasets through the above procedure and analyzed each. For each value of $M$ , in each of the datasets, we trained the parameters using (7), with $K$ natural spline bases ( $K$ varied from $4$ to $50$ ), on the training data and then calculated the MSE on the test data to examine how the MSE values changed as the number of basis functions $K$ increases.

The results are displayed in Figure 3 and Table 2, where the reported values are averaged over the $50$ datasets. In this scenario, the rank of the design matrix in (6) is low, which implies that the degrees of freedom of the model remain unchanged even as the number of basis functions increases. The right panel in Figure 3 shows that CV did not choose an excess number of bases. As a result, the observed MSEs ceased to decrease at around $K=M$ , suggesting that increasing the number of basis functions beyond this point is not particularly advantageous. Hence, if the number of observation points restricts the expressive power of the model of the regression model, the double-descent phenomenon does not occur.

These simulation studies demonstrate the potential benefits of using excess basis functions in SonF regression when the number of observation points is sufficiently large (Scenario A). The double-descent phenomenon is clearly observed, with the prediction accuracy improving as the number of basis functions increases beyond the sample size. However, when the number of observation points is limited (Scenario B), increasing the number of basis functions beyond the number of observation points does not lead to further improvements in prediction accuracy, and the double-descent phenomenon is not observed. These findings highlight the importance of considering the interplay between the sample size, the number of observation points, and the number of basis functions when applying scalar-on-function regression in practice. The use of excess basis functions, combined with the minimum norm interpolator, can be a valuable approach for improving prediction accuracy in scenarios where the number of observation points is sufficiently large relative to the sample size.

4.2. FonF Regression

As discussed in Section 3.2, the prediction accuracy in FonF regression is influenced by the interplay between the sample size $N$ , the number of observation points for the predictor and response functions ( $M_{1}$ and $M_{2}$ ), and the number of basis functions for the predictor and response functions ( $K_{1}$ and $K_{2}$ ). We now demonstrate Scenarios (C) and (D) through the following numerical experiments. The settings are summarized in Table 3.

Table 3. Summary of simulation settings and symbols for function-on-function regression.

Symbol	Description	Scenario (C)	Scenario (D)
$N$	Size of training dataset	Fixed ( $50$ )	Variable
$N_{\text{test}}$	Size of test dataset	Fixed ( $150$ )	Fixed ( $150$ )
$M_{1}$	Number of measurements for $x$	Fixed ( $75$ )	Fixed ( $75$ )
$M_{2}$	Number of measurements for $y$	Variable	Fixed ( $75$ )
$K_{1}$	Number of bases for $x$	Fixed ( $10$ )	Variable
$K_{2}$	Number of bases for $y$	Variable	Fixed ( $10$ )

Scenario (C)

Here, we investigate the relationship between $K_{2}$ (number of basis functions for the response function $y$ ) and $M_{2}$ (number of observation points for $y$ ). We consider the scenario where both the predictor $x$ and the response $y$ are functions. Specifically, we sampled $x$ from a GP whose kernel is an RBF having hyperparameters $(\theta,h)=(10,10)$ and centered it to be zero-mean. For every $t$ , we sampled $\beta(\cdot,t)$ from a GP with an RBF kernel having hyperparameters $(\theta,h)=(15,10)$ . The true response function was generated by integrating the product of $\beta(s,t)$ and $x_{i}(t)$ as (8), and the observations $\{\bm{y}_{i}\}$ were given by adding standard normal noise to $M_{2}$ points extracted from the function. Moreover, the observation vectors $\{\bm{x}_{i}\}$ are derived by randomly selecting $M_{1}=75$ points from the functions and adding standard normal noise. For each $M_{2}=5,10$ and $20$ , we generated $N=50$ observations as a training set and $150$ values as a test set.

For each value of $M$ , we generated $50$ datasets using the above procedure and analyzed each dataset using natural splines and (10) on the training sample of size $N$ , fixing $K_{1}$ at $10$ and varying $K_{2}$ from $4$ to $50$ . We then examined the relationship between the number of basis functions $K_{2}$ of the response function and the MSE for the test data.

Table 4. MSEs of function-on-function regressions for different basis selection methods, averaged over

50

simulated datasets.

	Scenario (C)			Scenario (D)
Method	$M_{2}$ =5	$M_{2}$ =10	$M_{2}$ =20	$N=5$	$N=10$	$N=20$
CV	9.263	8.652	10.744	8.021	5.257	3.500
Fixed	9.976	9.018	10.881	8.033	5.399	3.502
cAIC	337.620	99.037	11.761	18.746	9.602	8.214

The left panel in Figure 4 illustrates the change in the MSE values with the increasing number of bases for a representative dataset. The results show that the MSE value reaches its maximum when the number of bases of the response function equals the size of the training sample, after which the MSE decreases. As the individual prediction targets are functions, the prediction (interpolation of the predicted function) improves with an increase in the number of bases. Table 4 shows the results when $K_{2}$ is selected by CV and cAIC, respectively. As indicated, the basis selection via cAIC results in poor prediction performance. In particular, $M_{2}=5$ and $M_{2}=10$ fail to predict the response function either because the number of bases is too small to represent the function or because of overfitting. This poor performance can be attributed to the fact that the basis of the response function itself is considered, suggesting that the choice of basis is particularly sensitive in this scenario. In contrast, CV, choosing a large number of bases (the right panel in Figure 4) or fixing the number of bases at large values contributes to good interpolation performance.

Scenario (D)

In this section, we examine the relationship between $K_{1}$ (number of basis functions for the predictor function $x$ ) and $N$ (sample size). The generating process for the functions $x$ , $\beta$ , and $y$ in Equation (8) is the same as in the previous section. The observation vectors ${\bm{x}_{i}}$ and ${\bm{y}_{i}}$ are both obtained by randomly selecting $75$ points from the functions $x_{i}$ and $y_{i}$ , respectively, and adding centered Gaussian noise with unit variance. For $N=5,10$ , and $20$ , we generated $N$ observations as the training set and $150$ observations as the test set.

For each $N$ , we generated $50$ datasets and analyzed each one using (10), varying $K_{1}$ from $4$ to $50$ and fixing $K_{2}=10$ natural spline basis functions. We investigated the relationship between the number of basis functions for the predictor function $K_{1}$ and the MSE for different training sample sizes $N$ .

The simulation results are given in Figure 5 and Table 4. The left panel in Figure 5 illustrates the change in MSE with sample size $N$ for a representative dataset. Once again, the double-descent phenomenon is observed in this scenario. This result is essentially the same as in Scenario (A), as it involves the relationship between the sample size and the number of basis functions for the predictor function (although the number of observation points $M_{1}$ must be greater than $N$ ). The right panel in Figure 5 shows that CV tends to select excess bases; Table 4 confirms that, as before, using a larger number of basis functions results in better prediction accuracy.

These simulation results highlight the benefits of using excess basis functions in the FonF model for both the response function (Scenario C) and the predictor function (Scenario D). The double-descent phenomenon is evident in both scenarios, with the prediction accuracy improving as the number of basis functions increases beyond the sample size or the number of observation points for the response function. These findings underscore the practical importance of considering the interplay between the sample size, number of observation points, and number of bases.

5. Application to real datasets

This section provides examples of the double-descent phenomenon in functional regression, as evidenced by empirical data. We examine Scenario (A) across two commonly used datasets.

5.1. Gasoline Dataset

First, we focused on the “gasoline” dataset, stored in the R language “refund” package (Goldsmith et al., 2024). This dataset comprises octane numbers for 60 gasoline samples and their near-infrared reflectance spectra. The octane number serves as a scalar indicator, quantifying the combustion quality of the gasoline, and the 401 near-infrared reflectance spectra represent the molecular structure of the substance.

In this analysis, following Reiss and Ogden (2007) and Reiss and Ogden (2009), we treated a set of near-infrared reflectance spectra as a functional explanatory variable and considered the problem of predicting the octane number, treated as a response, based on the minimum norm interpolator (7), varying the number of natural spline basis functions We randomly selected 10 observations as the training data and calculated the MSE of the predictions on the remaining 50 observations, which served as the test data.

The MSEs with varying numbers of basis functions are shown in Figure 6. As can be seen in the figure, the MSE peaks at the same point as the size of the training sample ( $10$ ) and then gradually decreases. Notably, when the number of basis functions exceeds 50, the MSE becomes smaller than when fewer basis functions are used. This outcome suggests that leveraging a large number of basis functions can indeed enhance prediction accuracy for real data, as evidenced by the double-descent phenomenon shown here.

5.2. Diffusion Tensor Imaging Dataset

Next, we address the diffusion tensor imaging (DTI) dataset, which is commonly used in functional data analysis and is stored as “DTI” in the R language “refund” package. DTI is a modality based on magnetic resonance imaging (MRI) that allows the diffusion of water in the brain to be tracked. One hundred patients are scanned for DTI approximately once a year and undergo the PASAT (Paced Auditory Serial Addition Test), a neuropsychological test used to assess cognitive function.

Within this framework, following Goldsmith et al. (2011) and Goldsmith et al. (2012), we considered the fractional anisotropy tract profiles of the corpus callosum area (CCA) as a functional explanatory variable to predict the subject’s PASAT score as a response. Although patients may visit the clinic multiple times, each visit is treated as a distinct data point; data with missing values were removed. This resulted in a sample size $N=334$ , with $93$ observation points (i.e., $M=93$ ) for the explanatory variable CCA. We performed predictions based on (7) with natural spline bases, varying the number of bases. A training sample of size 20 was used, with the remaining 314 observations serving as the test data.

As illustrated in Figure 7, the double-descent is evident. The MSE peaks at approximately the same value as the training sample size and drops smoothly from there. In this case, the MSE does not decrease as much as in the gasoline dataset, possibly because the functional form of the explanatory variable is simple, and a few basis functions are sufficient to represent the function. However, the double-descent phenomenon clearly occurs, indicating the risk of conventionally searching solely for a smaller number of basis functions than the size of the training sample based on the idea of preventing overfitting.

6. Discussion

This study questions the conventional notion that the number of basis functions should be smaller than the number of observation points and asserts the benefits of considering an excess number of basis functions in the FDA. In Section 3, we argue that in functional regression, if one uses a number of basis functions above a certain threshold, the double-descent phenomenon can be observed and better prediction accuracy can potentially be achieved. We demonstrated this phenomenon through numerical experiments and found that optimal prediction accuracy can be realized to the right of the peak of the double-descent curve. Importantly, this phenomenon is not merely the subject of theoretical analysis or numerical experiments but can also be observed in real-world datasets. In both the gasoline and DTI datasets, a clear double-descent was observed, with the gasoline dataset producing optimal prediction accuracy beyond the peak. These findings provide valuable guidance in the analysis of functional data, strongly suggesting that when selecting the number of basis functions, one should consider a wider range of possibilities and not be limited by the sample size or the number of observation points.

Future research should extend investigations of the practicality of this phenomenon to different types of datasets and models, including functional time series. Additionally, beyond the minimum norm interpolator, the advantage of excess basis functions may be further supported by ridge regression, although this would require tuning parameter selection. Moreover, the theoretical foundations of the double-descent phenomenon in functional data analysis should be more deeply explored. While this study provided empirical evidence and intuitive explanations, a rigorous mathematical analysis of the conditions under which the phenomenon occurs and its relationship to the properties of the functional data and the chosen basis functions would strengthen the understanding and applicability of our findings.

Computer Programs

The computer programs used in this manuscript to demonstrate the double-descent curve in Section 4 and the application presented in Section 5 have been developed for execution in the R statistical computing environment. These programs are publicly available at the GitHub repository: https://github.com/TomWaka/DD-FDR.

Acknowledgements

T. Wakayama was supported by JSPS KAKENHI (22J21090) and H. Matsui was supported by JSPS KAKENHI (23K11005).

References

Akaike [1973] Hirotugu Akaike. Information theory and an extension of the maximum likelihood principle. Second international symposium on information theory, 1:267–281, 1973.
Araki et al. [2009] Yuko Araki, Sadanori Konishi, Shuichi Kawano, and Hidetoshi Matsui. Functional regression modeling via regularized Gaussian basis expansions. Annals of the Institute of Statistical Mathematics, 61:811–833, 2009. doi: 10.1007/s10463-007-0161-1.
Banerjee and Roy [2014] Sudipto Banerjee and Anindya Roy. Linear Algebra and Matrix Analysis for Statistics. Chapman and Hall/CRC, 1st edition, 2014. doi: 10.1201/b17040.
Bartlett et al. [2020] Peter L Bartlett, Philip M Long, Gábor Lugosi, and Alexander Tsigler. Benign overfitting in linear regression. Proceedings of the National Academy of Sciences, 117(48):30063–30070, 2020. doi: 10.1073/pnas.1907378117. URL https://doi.org/10.1073/pnas.1907378117.
Bedrick and Tsai [1994] Edward J. Bedrick and Chih-Ling Tsai. Model Selection for Multivariate Regression in Small Samples. Biometrics, 50(1):226–231, 1994. doi: 10.2307/2533213. URL https://doi.org/10.2307/2533213.
Belkin et al. [2018] Mikhail Belkin, Siyuan Ma, and Soumik Mandal. To Understand Deep Learning We Need to Understand Kernel Learning. In Proceedings of the 35th International Conference on Machine Learning, volume 80, pages 541–549. PMLR, 2018. URL https://proceedings.mlr.press/v80/belkin18a.html.
Belkin et al. [2019] Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine-learning practice and the classical bias–variance trade-off. Proceedings of the National Academy of Sciences, 116(32):15849–15854, 2019. doi: 10.1073/pnas.1903070116. URL https://doi.org/10.1073/pnas.1903070116.
Belkin et al. [2020] Mikhail Belkin, Daniel Hsu, and Ji Xu. Two Models of Double Descent for Weak Features. SIAM Journal on Mathematics of Data Science, 2(4):1167–1180, 2020. doi: 10.1137/20M1336072. URL https://doi.org/10.1137/20M1336072.
Fujii and Konishi [2006] Toru Fujii and Sadanori Konishi. Nonlinear regression modeling via regularized wavelets and smoothing parameter selection. Journal of Multivariate Analysis, 97(9):2023–2033, 2006. doi: https://doi.org/10.1016/j.jmva.2005.12.009. URL https://www.sciencedirect.com/science/article/pii/S0047259X06000856.
Goldsmith et al. [2011] Jeff Goldsmith, Jennifer Bobb, Ciprian M Crainiceanu, Brian Caffo, and Daniel Reich. Penalized Functional Regression. Journal of Computational and Graphical Statistics, 20(4):830–851, 2011. doi: 10.1198/jcgs.2010.10007. URL https://doi.org/10.1198/jcgs.2010.10007.
Goldsmith et al. [2012] Jeff Goldsmith, Ciprian M Crainiceanu, Brian Caffo, and Daniel Reich. Longitudinal penalized functional regression for cognitive outcomes on neuronal tract measurements. Journal of the Royal Statistical Society: Series C (Applied Statistics), 61(3):453–469, 2012. doi: https://doi.org/10.1111/j.1467-9876.2011.01031.x. URL https://rss.onlinelibrary.wiley.com/doi/abs/10.1111/j.1467-9876.2011.01031.x.
Goldsmith et al. [2024] Jeff Goldsmith, Fabian Scheipl, Lei Huang, Julia Wrobel, Chongzhi Di, Jonathan Gellar, Jaroslaw Harezlak, Mathew W. McLean, Bruce Swihart, Luo Xiao, Ciprian Crainiceanu, Philip T. Reiss, and Erjia Cui. refund: Regression with Functional Data, 2024. URL https://CRAN.R-project.org/package=refund. R package version 0.1-35.
Green and Silverman [1993] Peter J. Green and Bernard W. Silverman. Nonparametric Regression and Generalized Linear Models: A roughness penalty approach. Chapman and Hall/CRC, 1 edition, 1993. doi: 10.1201/b15710. URL https://doi.org/10.1201/b15710.
Hastie and Mallows [1993] Trevor Hastie and Colin Mallows. [A Statistical View of Some Chemometrics Regression Tools]: Discussion. Technometrics, 35(2):140–143, 1993. doi: 10.2307/1269658.
Hastie et al. [2009] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, volume 2. Springer New York, 2009. ISBN 978-0-387-84858-7. doi: 10.1007/978-0-387-84858-7. URL https://doi.org/10.1007/978-0-387-84858-7.
Hastie et al. [2022] Trevor Hastie, Andrea Montanari, Saharon Rosset, and Ryan J Tibshirani. Surprises in high-dimensional ridgeless least squares interpolation. The Annals of Statistics, 50(2):949 – 986, 2022. doi: 10.1214/21-AOS2133. URL https://doi.org/10.1214/21-AOS2133.
Horváth and Kokoszka [2012] Lajos Horváth and Piotr Kokoszka. Inference for Functional Data with Applications. Springer New York, 2012. doi: 10.1007/978-1-4614-3655-3. URL https://doi.org/10.1007/978-1-4614-3655-3.
Hsing and Eubank [2015] Tailen Hsing and Randall Eubank. Theoretical Foundations of Functional Data Analysis, with an Introduction to Linear Operators. John Wiley & Sons, Ltd, 2015. doi: 10.1002/9781118762547. URL https://doi.org/10.1002/9781118762547.
James et al. [2021] Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani. An Introduction to Statistical Learning: with Applications in R. Springer New York, 2021. doi: 10.1007/978-1-0716-1418-1. URL https://doi.org/10.1007/978-1-0716-1418-1.
Kokoszka and Reimherr [2017] Piotr Kokoszka and Matthew Reimherr. Introduction to Functional Data Analysis. Chapman and Hall/CRC, 2017. doi: 10.1201/9781315117416. URL https://doi.org/10.1201/9781315117416.
Konishi and Kitagawa [1996] Sadanori Konishi and Genshiro Kitagawa. Generalised information criteria in model selection. Biometrika, 83(4):875–890, 1996.
Matsui et al. [2009] Hidetoshi Matsui, Shuichi Kawano, and Sadanori Konishi. Regularized functional regression modeling for functional response and predictors. Journal of Math-for-Industry, 1:17–25, 2009.
Misiakiewicz and Montanari [2023] Theodor Misiakiewicz and Andrea Montanari. Six lectures on linearized neural networks. arXiv preprint arXiv:2308.13431, 2023.
Müller [2005] Hans-georg Müller. Functional Modelling and Classification of Longitudinal Data. Scandinavian Journal of Statistics, 32(2):223–240, 2005. doi: https://doi.org/10.1111/j.1467-9469.2005.00429.x. URL https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1467-9469.2005.00429.x.
R Core Team [2024] R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2024. URL https://www.R-project.org/.
Ramsay and Dalzell [1991] James O. Ramsay and Catherine J Dalzell. Some Tools for Functional Data Analysis. Journal of the Royal Statistical Society: Series B (Methodological), 53(3):539–561, 1991. doi: 10.1111/j.2517-6161.1991.tb01844.x. URL https://doi.org/10.1111/j.2517-6161.1991.tb01844.x.
Ramsay and Silverman [2005] James O. Ramsay and Bernard W. Silverman. Functional Data Analysis. Springer New York, 2 edition, 2005. doi: 10.1007/b98888. URL https://doi.org/10.1007/b98888.
Rasmussen and Williams [2006] Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian Processes for Machine Learning. MIT press Cambridge, MA, 2006.
Reiss and Ogden [2007] Philip T. Reiss and R. Todd Ogden. Functional Principal Component Regression and Functional Partial Least Squares. Journal of the American Statistical Association, 102(479):984–996, 2007. doi: 10.1198/016214507000000527.
Reiss and Ogden [2009] Philip T. Reiss and R. Todd Ogden. Smoothing parameter selection for a class of semiparametric linear models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 71(2):505–523, 2009. doi: https://doi.org/10.1111/j.1467-9868.2008.00695.x. URL https://rss.onlinelibrary.wiley.com/doi/abs/10.1111/j.1467-9868.2008.00695.x.
Schaeffer et al. [2024] Rylan Schaeffer, Zachary Robertson, Akhilan Boopathy, Mikail Khona, Kateryna Pistunova, Jason William Rocks, Ila R Fiete, Andrey Gromov, and Sanmi Koyejo. Double Descent Demystified: Identifying, Interpreting & Ablating the Sources of a Deep Learning Puzzle. In The Third Blogpost Track at ICLR 2024, 2024. URL https://openreview.net/forum?id=muC7uLvGHr.
Schwarz [1978] Gideon Schwarz. Estimating the Dimension of a Model. The Annals of Statistics, 6(2):461 – 464, 1978. doi: 10.1214/aos/1176344136.
Stone [1974] Mervyn Stone. Cross-validatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society: Series B (Methodological), 36(2):111–133, 1974.
Sugiura [1978] Nariaki Sugiura. Further analysis of the data by Akaike’s information criterion and the finite corrections. Communications in Statistics - Theory and Methods, 7(1):13–26, 1978. doi: 10.1080/03610927808827599. URL https://doi.org/10.1080/03610927808827599.
Wakayama and Sugasawa [2024] Tomoya Wakayama and Shonosuke Sugasawa. Functional Horseshoe Smoothing for Functional Trend Estimation. Statistica Sinica, 34(3), 2024. doi: 10.5705/ss.202022.0297.
Wang et al. [2016] Jane-Ling Wang, Jeng-Min Chiou, and Hans-Georg Müller. Functional data analysis. Annual Review of Statistics and Its Application, 3:257–295, 2016. doi: https://doi.org/10.1146/annurev-statistics-041715-033624. URL https://www.annualreviews.org/content/journals/10.1146/annurev-statistics-041715-033624.
Wood [2017] Simon N. Wood. Generalized Additive Models: An Introduction with R. Chapman and Hall/CRC, 2 edition, 2017. doi: 10.1201/9781315370279. URL https://doi.org/10.1201/9781315370279.
Zhang et al. [2021] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning (still) requires rethinking generalization. Communications of the ACM, 64(3):107–115, 2021. doi: 10.1145/3446776. URL https://doi.org/10.1145/3446776.