Reconciling Functional Data Regression with Excess Bases

Tomoya Wakayama Graduate School of Economics, The University of Tokyo
7-3-1 Hongo, Bunkyo-ku, Tokyo, Japan
 and  Hidetoshi Matsui Faculty of Data Science, Shiga University
1-1-1 Banba, Hikone, Shiga, Japan
(Date: July 7, 2024, Contact: [email protected])
Abstract.

As the development of measuring instruments and computers has accelerated the collection of massive amounts of data, functional data analysis (FDA) has experienced a surge of attention. The FDA methodology treats longitudinal data as a set of functions on which inference, including regression, is performed. Functionalizing data typically involves fitting the data with basis functions. In general, the number of basis functions smaller than the sample size is selected. This paper casts doubt on this convention. Recent statistical theory has revealed the so-called double-descent phenomenon in which excess parameters overcome overfitting and lead to precise interpolation. Applying this idea to choosing the number of bases to be used for functional data, we show that choosing an excess number of bases can lead to more accurate predictions. Specifically, we explored this phenomenon in a functional regression context and examined its validity through numerical experiments. In addition, we introduce two real-world datasets to demonstrate that the double-descent phenomenon goes beyond theoretical and numerical experiments, confirming its importance in practical applications.

Keywords. Basis expansion; Double-descent; Functional data regression; Minimum norm interpolator

1. Introduction

Functional data analysis (FDA) has emerged as a powerful tool for analyzing longitudinal data across diverse fields, including biology, medicine, economics, and the social sciences (Ramsay and Silverman, 2005; Horváth and Kokoszka, 2012; Kokoszka and Reimherr, 2017; Wang et al., 2016). The fundamental concept of FDA is to represent the longitudinally measured data for each individual as a smooth function and then analyze the collection of functions using various statistical techniques (Hsing and Eubank, 2015). This approach offers several advantages, such as reducing observational errors through smoothing and accommodating varying time points and numbers of observations for different subjects (e.g., Wakayama and Sugasawa, 2024).

In FDA, basis expansion is a widely used technique for transforming longitudinal data into functional data (Fujii and Konishi, 2006; Araki et al., 2009). Basis expansion is known for its ability to smooth noisy data and reveal the underlying structure (Green and Silverman, 1993; Hastie et al., 2009). In numerous FDA methodologies, such as functional regression and time series analysis, selecting the number of basis functions is a pivotal issue due to its substantial impact on prediction accuracy. The number of bases is selected from a range of values smaller than the number of observation points using information criteria (Akaike, 1973; Schwarz, 1978; Konishi and Kitagawa, 1996) or by employing cross-validation (Stone, 1974). This practice aims to avoid overfitting, i.e., it seeks to mitigate the explosion of interpolated values between observation points. However, recent developments in statistical theory suggest that this approach may need to be reconsidered to achieve better prediction performance.

Overfitting has long been a challenge in FDA; however, recent statistical theory has begun to reconcile this issue. Indeed, Zhang et al. (2021) empirically showed that deep neural network models with a large number of parameters that perfectly fit the training data can yield near-optimal accuracy for the test data. This phenomenon is referred to as the double-descent phenomenon (Belkin et al., 2018, 2019), where the interpolation error follows a conventional U-shaped curve up to a threshold, but decreases after reaching a peak at the threshold. In addition, Hastie et al. (2022); Belkin et al. (2020) theoretically revealed that the double-descent phenomenon can occur for linear regression models in several situations and showed the phenomenon empirically. For more detailed explanations, see James et al. (2021); Schaeffer et al. (2024); Misiakiewicz and Montanari (2023) and references therein. Further, James et al. (2021) demonstrated the double-descent phenomenon through a simple spline fitting. Figure 1 illustrates the phenomenon through fitting curves with measurement points. The figures on the left depict 15151515 numerically generated data points and the spline curves fitted with the minimum norm interpolator (Hastie et al., 2022; Bartlett et al., 2020) to estimate the parameters in the model for four different numbers of basis functions. A detailed description of the methodology is referred to in Section 2. The right panel displays the mean squared errors in relation to the number of basis functions. When the number of bases equals the number of measurements, the spline curve appears overly undulating, which causes the mean squared error to explode. However, as the number of bases increases, the fitted curve becomes less undulating and the mean squared error decreases again. This suggests that using a large number of basis functions, especially a number larger than the sample size, may improve the accuracy of functional data analysis techniques.

Refer to caption
Figure 1. Left: Curve fits when the number of bases is 4444 (upper left), 20202020 (upper right), 40404040 (lower left), and 120120120120 (lower right). Right: MSE for varying number of bases.

In this paper, we advocate the use of a large number of basis functions, in combination with the minimum norm interpolator, to transform observed longitudinal data into functional data. Additionally, we apply the minimum norm interpolator to estimate functional regression models, which represent relationships between predictors and responses, either or both of which are given as functional data. We discuss four representative functional data regression scenarios where double descent is particularly relevant. We examine the effectiveness of the proposed approach within the four scenarios through simulation studies and applications to real-world datasets.

The remainder of the paper is organized as follows. Section 2 introduces functionalization with an excess number of basis functions. In Section 3, we discuss regression methods for functional data and their relation to the double-descent phenomenon. We validate our approach through numerical experiments in Section 4. Section 5 demonstrates the importance of our advocations through applications to real datasets. Finally, we summarize our main points and suggest future research directions in Section 6.

2. Functionalization

Functionalization is a crucial first step in functional data analysis. Without appropriate functionalization, extracting meaningful descriptive statistics or reaching accurate inferential conclusions becomes challenging in regression and classification. The process of functionalization involves transforming discrete, noise-corrupted observations into smooth functions that capture the underlying patterns and trends in the data (Ramsay and Silverman, 2005).

Suppose we have N𝑁Nitalic_N sets of time-course observations, where the i𝑖iitalic_i-th subject has Misubscript𝑀𝑖M_{i}italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT observations {xi1,xi2,,xiMi}subscript𝑥𝑖1subscript𝑥𝑖2subscript𝑥𝑖subscript𝑀𝑖\{x_{i1},x_{i2},\ldots,x_{iM_{i}}\}{ italic_x start_POSTSUBSCRIPT italic_i 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_i 2 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_i italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT } at time points {ti1,ti2,,tiMi}subscript𝑡𝑖1subscript𝑡𝑖2subscript𝑡𝑖subscript𝑀𝑖\{t_{i1},t_{i2},\ldots,t_{iM_{i}}\}{ italic_t start_POSTSUBSCRIPT italic_i 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_i 2 end_POSTSUBSCRIPT , … , italic_t start_POSTSUBSCRIPT italic_i italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT } (i=1,2,,N)𝑖12𝑁(i=1,2,\ldots,N)( italic_i = 1 , 2 , … , italic_N ), respectively, and tijsubscript𝑡𝑖𝑗t_{ij}italic_t start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT are elements of a domain 𝒯𝒯\mathcal{T}\subset\mathbb{R}caligraphic_T ⊂ blackboard_R. We then consider transforming the time-course data into functions using the basis expansions (Ramsay and Silverman, 2005; Wang et al., 2016). Let {ϕk:𝒯}k=1Ksuperscriptsubscriptconditional-setsubscriptitalic-ϕ𝑘𝒯𝑘1𝐾\{\phi_{k}:\mathcal{T}\to\mathbb{R}\}_{k=1}^{K}{ italic_ϕ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT : caligraphic_T → blackboard_R } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT be a set of K𝐾Kitalic_K basis functions. We assume that each observation xijsubscript𝑥𝑖𝑗x_{ij}italic_x start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT can be expressed by the following regression form:

xij=k=1Kwikϕk(tij)+εij=𝒘iϕ(tij)+εij(j=1,,Mi),formulae-sequencesubscript𝑥𝑖𝑗superscriptsubscript𝑘1𝐾subscript𝑤𝑖𝑘subscriptitalic-ϕ𝑘subscript𝑡𝑖𝑗subscript𝜀𝑖𝑗superscriptsubscript𝒘𝑖topbold-italic-ϕsubscript𝑡𝑖𝑗subscript𝜀𝑖𝑗𝑗1subscript𝑀𝑖x_{ij}=\sum_{k=1}^{K}w_{ik}\phi_{k}(t_{ij})+\varepsilon_{ij}=\bm{w}_{i}^{\top}% \bm{\phi}(t_{ij})+\varepsilon_{ij}\quad(j=1,\ldots,M_{i}),italic_x start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) + italic_ε start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = bold_italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_ϕ ( italic_t start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) + italic_ε start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ( italic_j = 1 , … , italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , (1)

where 𝒘i=(wi1,wi2,,wiK)subscript𝒘𝑖superscriptsubscript𝑤𝑖1subscript𝑤𝑖2subscript𝑤𝑖𝐾top\bm{w}_{i}=(w_{i1},w_{i2},\ldots,w_{iK})^{\top}bold_italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ( italic_w start_POSTSUBSCRIPT italic_i 1 end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_i 2 end_POSTSUBSCRIPT , … , italic_w start_POSTSUBSCRIPT italic_i italic_K end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT is a vector of coefficients, ϕ(t)=(ϕ1(t),ϕ2(t),,ϕK(t))bold-italic-ϕ𝑡superscriptsubscriptitalic-ϕ1𝑡subscriptitalic-ϕ2𝑡subscriptitalic-ϕ𝐾𝑡top\bm{\phi}(t)=(\phi_{1}(t),\phi_{2}(t),\ldots,\phi_{K}(t))^{\top}bold_italic_ϕ ( italic_t ) = ( italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_t ) , italic_ϕ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_t ) , … , italic_ϕ start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( italic_t ) ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT is a vector of basis functions, and εi1,,εiMisubscript𝜀𝑖1subscript𝜀𝑖subscript𝑀𝑖\varepsilon_{i1},\ldots,\varepsilon_{iM_{i}}italic_ε start_POSTSUBSCRIPT italic_i 1 end_POSTSUBSCRIPT , … , italic_ε start_POSTSUBSCRIPT italic_i italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT are independent noise terms with mean 00 and variance σi2superscriptsubscript𝜎𝑖2\sigma_{i}^{2}italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. Common choices for basis functions include the Fourier basis, spline basis, and wavelet basis (Ramsay and Silverman, 2005).

We then calculate the optimal coefficient vector 𝒘isubscript𝒘𝑖\bm{w}_{i}bold_italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Using the notation 𝒙i=(xi1,xi2,,xiMi)subscript𝒙𝑖superscriptsubscript𝑥𝑖1subscript𝑥𝑖2subscript𝑥𝑖subscript𝑀𝑖top\bm{x}_{i}=(x_{i1},x_{i2},\ldots,x_{iM_{i}})^{\top}bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ( italic_x start_POSTSUBSCRIPT italic_i 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_i 2 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_i italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT, Φ=(ϕ(ti1),ϕ(ti2),,ϕ(tiMi))Φsuperscriptbold-italic-ϕsubscript𝑡𝑖1bold-italic-ϕsubscript𝑡𝑖2bold-italic-ϕsubscript𝑡𝑖subscript𝑀𝑖top\Phi=(\bm{\phi}(t_{i1}),\bm{\phi}(t_{i2}),\ldots,\bm{\phi}(t_{iM_{i}}))^{\top}roman_Φ = ( bold_italic_ϕ ( italic_t start_POSTSUBSCRIPT italic_i 1 end_POSTSUBSCRIPT ) , bold_italic_ϕ ( italic_t start_POSTSUBSCRIPT italic_i 2 end_POSTSUBSCRIPT ) , … , bold_italic_ϕ ( italic_t start_POSTSUBSCRIPT italic_i italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT, and 𝜺i=(εi1,εi2,,εini)subscript𝜺𝑖superscriptsubscript𝜀𝑖1subscript𝜀𝑖2subscript𝜀𝑖subscript𝑛𝑖top\bm{\varepsilon}_{i}=(\varepsilon_{i1},\varepsilon_{i2},\ldots,\varepsilon_{in% _{i}})^{\top}bold_italic_ε start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ( italic_ε start_POSTSUBSCRIPT italic_i 1 end_POSTSUBSCRIPT , italic_ε start_POSTSUBSCRIPT italic_i 2 end_POSTSUBSCRIPT , … , italic_ε start_POSTSUBSCRIPT italic_i italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT, the regression model (1) can be expressed as 𝒙i=Φ𝒘i+𝜺isubscript𝒙𝑖Φsubscript𝒘𝑖subscript𝜺𝑖\bm{x}_{i}=\Phi\bm{w}_{i}+\bm{\varepsilon}_{i}bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = roman_Φ bold_italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + bold_italic_ε start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. We estimate 𝒘isubscript𝒘𝑖\bm{w}_{i}bold_italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT using the minimum norm interpolator (Hastie et al., 2022; Bartlett et al., 2020):

argmin𝒘iK𝒘is.t.𝒘iminimizes𝒙iΦ𝒘i,\displaystyle\operatornamewithlimits{argmin}_{\bm{w}_{i}\in\mathbb{R}^{K}}\|% \bm{w}_{i}\|\quad{\rm s.t.}\quad\bm{w}_{i}\quad{\rm minimizes}\quad\|\bm{x}_{i% }-\Phi\bm{w}_{i}\|,roman_argmin start_POSTSUBSCRIPT bold_italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∥ bold_italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ roman_s . roman_t . bold_italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_minimizes ∥ bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - roman_Φ bold_italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ ,

where \|\cdot\|∥ ⋅ ∥ denotes the Euclidean norm. The solution to the above optimization problem is explicitly given by

𝒘^i=(ΦΦ)Φ𝒙i,subscript^𝒘𝑖superscriptsuperscriptΦtopΦsuperscriptΦtopsubscript𝒙𝑖\displaystyle\widehat{\bm{w}}_{i}=(\Phi^{\top}\Phi)^{\dagger}\Phi^{\top}\bm{x}% _{i},over^ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ( roman_Φ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT roman_Φ ) start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT roman_Φ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , (2)

where (ΦΦ)superscriptsuperscriptΦtopΦ(\Phi^{\top}\Phi)^{\dagger}( roman_Φ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT roman_Φ ) start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT denotes the Moore-Penrose pseudo-inverse matrix (e.g., Banerjee and Roy, 2014) of ΦΦsuperscriptΦtopΦ\Phi^{\top}\Phiroman_Φ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT roman_Φ. Using the estimated coefficients 𝒘^isubscript^𝒘𝑖\widehat{\bm{w}}_{i}over^ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, we express the functional representation of the i𝑖iitalic_i-th subject’s data as xi(t)=𝒘^iϕ(t)subscript𝑥𝑖𝑡superscriptsubscript^𝒘𝑖topbold-italic-ϕ𝑡x_{i}(t)=\widehat{\bm{w}}_{i}^{\top}\bm{\phi}(t)italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) = over^ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_ϕ ( italic_t ).

Regarding the choice of the number of basis functions K𝐾Kitalic_K, traditional approaches often select K𝐾Kitalic_K to be smaller than the number of observations Misubscript𝑀𝑖M_{i}italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to avoid overfitting (Ramsay and Silverman, 2005). However, recent theoretical evaluations by Hastie et al. (2022) suggest that using a larger number of parameters (bases, in this context) can be beneficial in cases where the noise level is low and the model is misspecified. In light of these insights, we propose using an excess number of basis functions, combined with the minimum norm interpolator, for functionalization in FDA. This approach has the potential to capture more complex patterns in the data and improve the accuracy of interpolations or subsequent analyses, especially in low-noise settings or when the true underlying function does not perfectly align with the chosen basis.

3. Functional Regression Model

In this section, we construct estimators through basis expansions for three standard models.

3.1. Scalar on Function Regression

Consider an independently and identically distributed dataset 𝒟:={xi,yi}i=1Nassign𝒟superscriptsubscriptsubscript𝑥𝑖subscript𝑦𝑖𝑖1𝑁\mathcal{D}:=\{x_{i},y_{i}\}_{i=1}^{N}caligraphic_D := { italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT, with explanatory function xi()L2(𝒮)subscript𝑥𝑖subscript𝐿2𝒮x_{i}(\cdot)\in L_{2}(\mathcal{S})italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( ⋅ ) ∈ italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( caligraphic_S ) on domain 𝒮𝒮\mathcal{S}\subset\mathbb{R}caligraphic_S ⊂ blackboard_R and scalar response variable yisubscript𝑦𝑖y_{i}\in\mathbb{R}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R. Suppose that predicting the response y𝑦yitalic_y when a new x𝑥xitalic_x is observed is of interest. We employ the following scalar-on-function regression model (SonF, Hastie and Mallows, 1993; Müller, 2005; Araki et al., 2009):

yi=𝒮xi(s)β(s)𝑑s+εi,subscript𝑦𝑖subscript𝒮subscript𝑥𝑖𝑠𝛽𝑠differential-d𝑠subscript𝜀𝑖\displaystyle y_{i}=\int_{\mathcal{S}}x_{i}(s)\beta(s)ds+\varepsilon_{i},italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ∫ start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s ) italic_β ( italic_s ) italic_d italic_s + italic_ε start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , (3)

where βL2(𝒮)𝛽subscript𝐿2𝒮\beta\in L_{2}(\mathcal{S})italic_β ∈ italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( caligraphic_S ) is a functional coefficient and εisubscript𝜀𝑖\varepsilon_{i}italic_ε start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is an error term with mean zero and finite variance. This model assumes a linear relationship between the functional predictor xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and the scalar response yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, mediated by the functional coefficient β𝛽\betaitalic_β.

We can represent xi(s)subscript𝑥𝑖𝑠x_{i}(s)italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s ) and β(s)𝛽𝑠\beta(s)italic_β ( italic_s ) using basis expansions:

xi(s)subscript𝑥𝑖𝑠\displaystyle x_{i}(s)italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s ) =k=1Kwikϕk(s),andβ(s)=k=1Kbkϕk(s),formulae-sequenceabsentsuperscriptsubscript𝑘1𝐾subscript𝑤𝑖𝑘subscriptitalic-ϕ𝑘𝑠and𝛽𝑠superscriptsubscript𝑘1𝐾subscript𝑏𝑘subscriptitalic-ϕ𝑘𝑠\displaystyle=\sum_{k=1}^{K}w_{ik}\phi_{k}(s),~{}~{}\mathrm{and}~{}~{}\beta(s)% =\sum_{k=1}^{K}b_{k}\phi_{k}(s),= ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_s ) , roman_and italic_β ( italic_s ) = ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_s ) ,

where ϕksubscriptitalic-ϕ𝑘\phi_{k}italic_ϕ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT are the basis functions, wiksubscript𝑤𝑖𝑘w_{ik}italic_w start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT and bksubscript𝑏𝑘b_{k}italic_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT are corresponding coefficients for xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and β𝛽\betaitalic_β, respectively, and K𝐾Kitalic_K is the number of basis functions. The coefficients wiksubscript𝑤𝑖𝑘w_{ik}italic_w start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT are obtained using the minimum norm interpolator (2); therefore, the wiksubscript𝑤𝑖𝑘w_{ik}italic_w start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT are known here. For notational simplicity, we write the above expansion in vector form as

xi(s)=𝒘i(K)ϕ(K)(s),andβ(s)=𝒃(K)ϕ(K)(s),formulae-sequencesubscript𝑥𝑖𝑠superscriptsubscript𝒘𝑖limit-from𝐾topsuperscriptbold-italic-ϕ𝐾𝑠and𝛽𝑠superscript𝒃limit-from𝐾topsuperscriptbold-italic-ϕ𝐾𝑠\displaystyle x_{i}(s)=\bm{w}_{i}^{(K)\top}\bm{\phi}^{(K)}(s),~{}~{}\mathrm{% and}~{}~{}\beta(s)=\bm{b}^{(K)\top}\bm{\phi}^{(K)}(s),italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s ) = bold_italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_K ) ⊤ end_POSTSUPERSCRIPT bold_italic_ϕ start_POSTSUPERSCRIPT ( italic_K ) end_POSTSUPERSCRIPT ( italic_s ) , roman_and italic_β ( italic_s ) = bold_italic_b start_POSTSUPERSCRIPT ( italic_K ) ⊤ end_POSTSUPERSCRIPT bold_italic_ϕ start_POSTSUPERSCRIPT ( italic_K ) end_POSTSUPERSCRIPT ( italic_s ) , (4)

where ϕ(K)(s):=(ϕ1(s),,ϕK(s))assignsuperscriptbold-italic-ϕ𝐾𝑠superscriptsubscriptitalic-ϕ1𝑠subscriptitalic-ϕ𝐾𝑠top\bm{\phi}^{(K)}(s):=(\phi_{1}(s),\ldots,\phi_{K}(s))^{\top}bold_italic_ϕ start_POSTSUPERSCRIPT ( italic_K ) end_POSTSUPERSCRIPT ( italic_s ) := ( italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_s ) , … , italic_ϕ start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( italic_s ) ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT, 𝒘i(K):=(wi1,,wiK)assignsuperscriptsubscript𝒘𝑖𝐾superscriptsubscript𝑤𝑖1subscript𝑤𝑖𝐾top\bm{w}_{i}^{(K)}:=(w_{i1},\ldots,w_{iK})^{\top}bold_italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_K ) end_POSTSUPERSCRIPT := ( italic_w start_POSTSUBSCRIPT italic_i 1 end_POSTSUBSCRIPT , … , italic_w start_POSTSUBSCRIPT italic_i italic_K end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT and 𝒃(K):=(b1,,bK)assignsuperscript𝒃𝐾superscriptsubscript𝑏1subscript𝑏𝐾top\bm{b}^{(K)}:=(b_{1},\ldots,b_{K})^{\top}bold_italic_b start_POSTSUPERSCRIPT ( italic_K ) end_POSTSUPERSCRIPT := ( italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_b start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT. The upper subscripts of the vectors are added to explicitly represent the number of bases.

Using the above expansion, we can rewrite (3) as

yisubscript𝑦𝑖\displaystyle y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT =𝒘i(K)Φ(K)𝒃(K)+εiabsentsuperscriptsubscript𝒘𝑖limit-from𝐾topsuperscriptΦ𝐾superscript𝒃𝐾subscript𝜀𝑖\displaystyle=\bm{w}_{i}^{(K)\top}\Phi^{(K)}\bm{b}^{(K)}+\varepsilon_{i}= bold_italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_K ) ⊤ end_POSTSUPERSCRIPT roman_Φ start_POSTSUPERSCRIPT ( italic_K ) end_POSTSUPERSCRIPT bold_italic_b start_POSTSUPERSCRIPT ( italic_K ) end_POSTSUPERSCRIPT + italic_ε start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT
=𝒛i𝒃(K)+εi,absentsuperscriptsubscript𝒛𝑖topsuperscript𝒃𝐾subscript𝜀𝑖\displaystyle=\bm{z}_{i}^{\top}\bm{b}^{(K)}+\varepsilon_{i},= bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_b start_POSTSUPERSCRIPT ( italic_K ) end_POSTSUPERSCRIPT + italic_ε start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , (5)

where Φ(K)superscriptΦ𝐾\Phi^{(K)}roman_Φ start_POSTSUPERSCRIPT ( italic_K ) end_POSTSUPERSCRIPT denotes the K×K𝐾𝐾K\times Kitalic_K × italic_K matrix, whose (i,j)𝑖𝑗(i,j)( italic_i , italic_j )-th entry is 𝒮ϕi(s)ϕj(s)𝑑ssubscript𝒮subscriptitalic-ϕ𝑖𝑠subscriptitalic-ϕ𝑗𝑠differential-d𝑠\int_{\mathcal{S}}\phi_{i}(s)\phi_{j}(s)ds∫ start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s ) italic_ϕ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_s ) italic_d italic_s, and 𝒛i=Φ(K)𝒘i(K)subscript𝒛𝑖superscriptΦ𝐾superscriptsubscript𝒘𝑖𝐾\bm{z}_{i}=\Phi^{(K)}\bm{w}_{i}^{(K)}bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = roman_Φ start_POSTSUPERSCRIPT ( italic_K ) end_POSTSUPERSCRIPT bold_italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_K ) end_POSTSUPERSCRIPT. Then, the joint equation for all observations can be written as

𝒚=Z𝒃(K)+𝜺,𝒚𝑍superscript𝒃𝐾𝜺\displaystyle\bm{y}=Z\bm{b}^{(K)}+\bm{\varepsilon},bold_italic_y = italic_Z bold_italic_b start_POSTSUPERSCRIPT ( italic_K ) end_POSTSUPERSCRIPT + bold_italic_ε , (6)

where 𝒚=(y1,y2,,yN)𝒚superscriptsubscript𝑦1subscript𝑦2subscript𝑦𝑁top\bm{y}=(y_{1},y_{2},\ldots,y_{N})^{\top}bold_italic_y = ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_y start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT, Z=(𝒛1,𝒛2,,𝒛N)superscript𝑍topsuperscriptsuperscriptsubscript𝒛1topsuperscriptsubscript𝒛2topsuperscriptsubscript𝒛𝑁toptopZ^{\top}=(\bm{z}_{1}^{\top},\bm{z}_{2}^{\top},\ldots,\bm{z}_{N}^{\top})^{\top}italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT = ( bold_italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT , bold_italic_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT , … , bold_italic_z start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT and 𝜺=(ε1,ε2,,εN).𝜺superscriptsubscript𝜀1subscript𝜀2subscript𝜀𝑁top\bm{\varepsilon}=(\varepsilon_{1},\varepsilon_{2},\ldots,\varepsilon_{N})^{% \top}.bold_italic_ε = ( italic_ε start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_ε start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_ε start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT .

When K<N𝐾𝑁K<Nitalic_K < italic_N, the ordinary least squares estimator (ZZ)1Z𝒚superscriptsuperscript𝑍top𝑍1superscript𝑍top𝒚(Z^{\top}Z)^{-1}Z^{\top}\bm{y}( italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_Z ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_y can be used to estimate 𝒃(K)superscript𝒃𝐾\bm{b}^{(K)}bold_italic_b start_POSTSUPERSCRIPT ( italic_K ) end_POSTSUPERSCRIPT. However, we are interested in the case where K𝐾Kitalic_K can be larger than N𝑁Nitalic_N, and ZZsuperscript𝑍top𝑍Z^{\top}Zitalic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_Z is not invertible. Then, we introduce the minimum norm interpolator:

argmin𝒃(K)𝒃(K)s.t.𝒃(K)minimizes𝒚Z𝒃(K),\displaystyle\operatornamewithlimits{argmin}_{\bm{b}^{(K)}}\|\bm{b}^{(K)}\|% \quad{\rm s.t.}\quad\bm{b}^{(K)}\quad{\rm minimizes}\quad\|\bm{y}-Z\bm{b}^{(K)% }\|,roman_argmin start_POSTSUBSCRIPT bold_italic_b start_POSTSUPERSCRIPT ( italic_K ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∥ bold_italic_b start_POSTSUPERSCRIPT ( italic_K ) end_POSTSUPERSCRIPT ∥ roman_s . roman_t . bold_italic_b start_POSTSUPERSCRIPT ( italic_K ) end_POSTSUPERSCRIPT roman_minimizes ∥ bold_italic_y - italic_Z bold_italic_b start_POSTSUPERSCRIPT ( italic_K ) end_POSTSUPERSCRIPT ∥ ,

which is equivalent to

𝒃^(K)=(ZZ)Z𝒚.superscript^𝒃𝐾superscriptsuperscript𝑍top𝑍superscript𝑍top𝒚\displaystyle\widehat{\bm{b}}^{(K)}=(Z^{\top}Z)^{\dagger}Z^{\top}\bm{y}.over^ start_ARG bold_italic_b end_ARG start_POSTSUPERSCRIPT ( italic_K ) end_POSTSUPERSCRIPT = ( italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_Z ) start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_y . (7)

In other words, we adopt Z𝒃^(K)𝑍superscript^𝒃𝐾Z\widehat{\bm{b}}^{(K)}italic_Z over^ start_ARG bold_italic_b end_ARG start_POSTSUPERSCRIPT ( italic_K ) end_POSTSUPERSCRIPT as the predictor of the new observations.

Since, in real measurements, data are observed at a finite number of discrete time points, we need to take that number into account. Here, for brevity, the number of observation points is assumed to be common across all individuals. Let M𝑀Mitalic_M be the number of x𝑥xitalic_x observation points (it should be noted that the following discussion can be extended in a straightforward way to the case in which the number of observations is heterogeneous). Since M𝑀Mitalic_M controls the information contained in the regression model, it will have a significant impact on prediction accuracy.

Now, for precise prediction, we explore the way to select the number of bases, which is the only value that the analysts can control. To investigate the relationship between the number of basis functions K𝐾Kitalic_K, the sample size N𝑁Nitalic_N, and the number of observation points M𝑀Mitalic_M, and their impact on the double-descent phenomenon, we consider two scenarios:

  • (A)

    N<M𝑁𝑀N<Mitalic_N < italic_M: If 1K<M1𝐾𝑀1\leq K<M1 ≤ italic_K < italic_M, the model in (6) is a regression problem with sample size N𝑁Nitalic_N and number of parameters K𝐾Kitalic_K. As K𝐾Kitalic_K gradually increases from 1, a double-descent phenomenon with a peak at K=N𝐾𝑁K=Nitalic_K = italic_N will be observed. This can be understood by regarding the original regression as an over-parameterized linear regression.

  • (B)

    M<N𝑀𝑁M<Nitalic_M < italic_N: In this case, since rankZ(M)annotatedrank𝑍absent𝑀\mathrm{rank}Z~{}(\leq M)roman_rank italic_Z ( ≤ italic_M ) is less than N𝑁Nitalic_N, the double-descent with respect to N𝑁Nitalic_N does not occur. Since the expressive power of the model in (6) is limited to less than the number of observation points if M𝑀Mitalic_M is small, accuracy will reach a ceiling even when the number of bases is increased.

The model considered here is a simple linear regression model, and the concern in such a case is model misspecification. In real data analysis, the true functional data (i.e., the data generating process) is unknown, and there are features that cannot be captured by a finite set of basis functions chosen arbitrarily by the analyst. For example, approximating a function with a few dozen spline bases may not describe periodicity or the variation of spikes. In a rough sense, Equation (5) is considered a misspecified model. However, as stated in Section 5 of Hastie et al. (2022), even if the model is misspecified, increasing the dimension of the parameters will contribute to improved prediction accuracy. This implies that increasing the number of basis functions is also robust to model misspecification, providing further motivation for the use of excess basis functions in functional regression.

3.2. Function on Function Regression

Consider an independent and identically distributed dataset 𝒟:={xi,yi}i=1Nassign𝒟superscriptsubscriptsubscript𝑥𝑖subscript𝑦𝑖𝑖1𝑁\mathcal{D}:=\{x_{i},y_{i}\}_{i=1}^{N}caligraphic_D := { italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT, where xi()L2(𝒮)subscript𝑥𝑖subscript𝐿2𝒮x_{i}(\cdot)\in L_{2}(\mathcal{S})italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( ⋅ ) ∈ italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( caligraphic_S ) is an explanatory function on domain 𝒮𝒮\mathcal{S}\subset\mathbb{R}caligraphic_S ⊂ blackboard_R, and yi()subscript𝑦𝑖y_{i}(\cdot)italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( ⋅ ) is a response function on domain 𝒯𝒯\mathcal{T}\subset\mathbb{R}caligraphic_T ⊂ blackboard_R. Our goal is to predict the response function y𝑦yitalic_y when a new function x𝑥xitalic_x is observed. We adopt the following function-on-function regression model (FonF, Ramsay and Dalzell, 1991; Matsui et al., 2009):

yi(t)=𝒮β(s,t)xi(s)𝑑s+εi(t),subscript𝑦𝑖𝑡subscript𝒮𝛽𝑠𝑡subscript𝑥𝑖𝑠differential-d𝑠subscript𝜀𝑖𝑡\displaystyle y_{i}(t)=\int_{\mathcal{S}}\beta(s,t)x_{i}(s)ds+\varepsilon_{i}(% t),italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) = ∫ start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT italic_β ( italic_s , italic_t ) italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s ) italic_d italic_s + italic_ε start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) , (8)

where β(s,t)𝛽𝑠𝑡\beta(s,t)italic_β ( italic_s , italic_t ) is a bivariate functional coefficient, and εi(t)subscript𝜀𝑖𝑡\varepsilon_{i}(t)italic_ε start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) is an error process with mean zero and constant variance σ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. This model assumes a linear relationship between the functional predictor xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and the functional response yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, mediated by the bivariate functional coefficient β𝛽\betaitalic_β.

Using basis expansion, as in Equation (4), we can represent the functional predictor, the bivariate functional coefficient, and the functional response as

xi(s)=𝒘i(K1)ϕ(K1)(s),β(s,t)=ϕ(K1)(s)B𝝍(K2)(t),yi(t)=𝒗i(K2)𝝍(K2)(t),formulae-sequencesubscript𝑥𝑖𝑠superscriptsubscript𝒘𝑖limit-fromsubscript𝐾1topsuperscriptbold-italic-ϕsubscript𝐾1𝑠formulae-sequence𝛽𝑠𝑡superscriptbold-italic-ϕlimit-fromsubscript𝐾1top𝑠𝐵superscript𝝍subscript𝐾2𝑡subscript𝑦𝑖𝑡superscriptsubscript𝒗𝑖limit-fromsubscript𝐾2topsuperscript𝝍subscript𝐾2𝑡x_{i}(s)=\bm{w}_{i}^{(K_{1})\top}\bm{\phi}^{(K_{1})}(s),\ \ \ \beta(s,t)=\bm{% \phi}^{(K_{1})\top}(s)B\bm{\psi}^{(K_{2})}(t),\ \ \ y_{i}(t)=\bm{v}_{i}^{(K_{2% })\top}\bm{\psi}^{(K_{2})}(t),italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s ) = bold_italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ⊤ end_POSTSUPERSCRIPT bold_italic_ϕ start_POSTSUPERSCRIPT ( italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT ( italic_s ) , italic_β ( italic_s , italic_t ) = bold_italic_ϕ start_POSTSUPERSCRIPT ( italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ⊤ end_POSTSUPERSCRIPT ( italic_s ) italic_B bold_italic_ψ start_POSTSUPERSCRIPT ( italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT ( italic_t ) , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) = bold_italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ⊤ end_POSTSUPERSCRIPT bold_italic_ψ start_POSTSUPERSCRIPT ( italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT ( italic_t ) ,

where 𝒗i(K2)=(vi1,,viK2)superscriptsubscript𝒗𝑖subscript𝐾2superscriptsubscript𝑣𝑖1subscript𝑣𝑖subscript𝐾2top\bm{v}_{i}^{(K_{2})}=(v_{i1},\ldots,v_{iK_{2}})^{\top}bold_italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT = ( italic_v start_POSTSUBSCRIPT italic_i 1 end_POSTSUBSCRIPT , … , italic_v start_POSTSUBSCRIPT italic_i italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT is the coefficient vector of the bases 𝝍(K2)(t)=(ψ1(t),,ψK2(t))superscript𝝍subscript𝐾2𝑡superscriptsubscript𝜓1𝑡subscript𝜓subscript𝐾2𝑡top\bm{\psi}^{(K_{2})}(t)=(\psi_{1}(t),\ldots,\psi_{K_{2}}(t))^{\top}bold_italic_ψ start_POSTSUPERSCRIPT ( italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT ( italic_t ) = ( italic_ψ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_t ) , … , italic_ψ start_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_t ) ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT, and B𝐵Bitalic_B is the coefficient matrix of ϕ(K1)(s)superscriptbold-italic-ϕsubscript𝐾1𝑠\bm{\phi}^{(K_{1})}(s)bold_italic_ϕ start_POSTSUPERSCRIPT ( italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT ( italic_s ) and 𝝍(K2)(t)superscript𝝍subscript𝐾2𝑡\bm{\psi}^{(K_{2})}(t)bold_italic_ψ start_POSTSUPERSCRIPT ( italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT ( italic_t ). Here the coefficients wiksubscript𝑤𝑖𝑘w_{ik}italic_w start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT (k=1,2,,K1)𝑘12subscript𝐾1(k=1,2,\ldots,K_{1})( italic_k = 1 , 2 , … , italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) and vilsubscript𝑣𝑖𝑙v_{il}italic_v start_POSTSUBSCRIPT italic_i italic_l end_POSTSUBSCRIPT (l=1,2,,K2)𝑙12subscript𝐾2(l=1,2,\ldots,K_{2})( italic_l = 1 , 2 , … , italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) are obtained using the minimum norm interpolator, as described in Equation (2). Substituting the basis function expansions into Equation (8), we obtain

𝒗i(K2)𝝍(K2)(t)=𝒘i(K1)Φ(K1)B𝝍(K2)(t)+εi(t).superscriptsubscript𝒗𝑖limit-fromsubscript𝐾2topsuperscript𝝍subscript𝐾2𝑡superscriptsubscript𝒘𝑖limit-fromsubscript𝐾1topsuperscriptΦsubscript𝐾1𝐵superscript𝝍subscript𝐾2𝑡subscript𝜀𝑖𝑡\displaystyle\bm{v}_{i}^{(K_{2})\top}\bm{\psi}^{(K_{2})}(t)=\bm{w}_{i}^{(K_{1}% )\top}\Phi^{(K_{1})}B\bm{\psi}^{(K_{2})}(t)+\varepsilon_{i}(t).bold_italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ⊤ end_POSTSUPERSCRIPT bold_italic_ψ start_POSTSUPERSCRIPT ( italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT ( italic_t ) = bold_italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ⊤ end_POSTSUPERSCRIPT roman_Φ start_POSTSUPERSCRIPT ( italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT italic_B bold_italic_ψ start_POSTSUPERSCRIPT ( italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT ( italic_t ) + italic_ε start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) . (9)

To estimate the coefficient matrix B𝐵Bitalic_B, we consider solving the following minimization problem:

argminBK1×K2vec(B)s.t.BminimizesV𝝍(K2)(t)ZB𝝍(K2)(t)L2,\displaystyle\operatornamewithlimits{argmin}_{B\in\mathbb{R}^{K_{1}\times K_{2% }}}\|\operatorname{vec}(B)\|\quad{\rm s.t.}\quad B\quad{\rm minimizes}\quad\|V% \bm{\psi}^{(K_{2})}(t)-ZB\bm{\psi}^{(K_{2})}(t)\|_{L_{2}},roman_argmin start_POSTSUBSCRIPT italic_B ∈ blackboard_R start_POSTSUPERSCRIPT italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT × italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∥ roman_vec ( italic_B ) ∥ roman_s . roman_t . italic_B roman_minimizes ∥ italic_V bold_italic_ψ start_POSTSUPERSCRIPT ( italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT ( italic_t ) - italic_Z italic_B bold_italic_ψ start_POSTSUPERSCRIPT ( italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT ( italic_t ) ∥ start_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ,

where V=(𝒗1(K2),𝒗2(K2),,𝒗N(K2))𝑉superscriptsuperscriptsubscript𝒗1subscript𝐾2superscriptsubscript𝒗2subscript𝐾2superscriptsubscript𝒗𝑁subscript𝐾2topV=(\bm{v}_{1}^{(K_{2})},\bm{v}_{2}^{(K_{2})},\ldots,\bm{v}_{N}^{(K_{2})})^{\top}italic_V = ( bold_italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT , bold_italic_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT , … , bold_italic_v start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT, vec()vec\operatorname{vec}(\cdot)roman_vec ( ⋅ ) is the vectorization operator of a matrix and L2\|\cdot\|_{L_{2}}∥ ⋅ ∥ start_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT is L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT norm. Then, minimizing the least square error yields

vec(B^)=(ΨZZ)vec(ZVΨ),vec^𝐵superscripttensor-productΨsuperscript𝑍top𝑍vecsuperscript𝑍top𝑉Ψ\displaystyle\operatorname{vec}(\widehat{B})=(\Psi\otimes Z^{\top}Z)^{\dagger}% \operatorname{vec}(Z^{\top}V\Psi),roman_vec ( over^ start_ARG italic_B end_ARG ) = ( roman_Ψ ⊗ italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_Z ) start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT roman_vec ( italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_V roman_Ψ ) , (10)

where ΨΨ\Psiroman_Ψ is a K2×K2subscript𝐾2subscript𝐾2K_{2}\times K_{2}italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT × italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT matrix whose (i,j)𝑖𝑗(i,j)( italic_i , italic_j )-th entry is 𝒯ψi(t)ψj(t)𝑑tsubscript𝒯subscript𝜓𝑖𝑡subscript𝜓𝑗𝑡differential-d𝑡\int_{\mathcal{T}}\psi_{i}(t)\psi_{j}(t)dt∫ start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) italic_ψ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t ) italic_d italic_t. We consider this to be an estimator for the FonF problem.

In practice, the functional predictor and response are observed at a finite number of discrete time points. Let M1subscript𝑀1M_{1}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and M2subscript𝑀2M_{2}italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT be the number of time points for x𝑥xitalic_x and y𝑦yitalic_y, respectively, assumed, for simplicity, to be the same across individuals. The dimensions of the observed data can affect the properties of the estimator. There are many possible combinations of the sample size N𝑁Nitalic_N, the number of observation points M1subscript𝑀1M_{1}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and M2subscript𝑀2M_{2}italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, and the number of basis functions K1subscript𝐾1K_{1}italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and K2subscript𝐾2K_{2}italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. However, two scenarios are particularly relevant to the double-descent phenomenon:

  • (C)

    M2subscript𝑀2M_{2}italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and K2subscript𝐾2K_{2}italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT: The parameter K2subscript𝐾2K_{2}italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT directly influences the prediction of the function y𝑦yitalic_y. Based on the idea that a function can be predicted with good accuracy if the unobserved parts are properly interpolated, increasing K2subscript𝐾2K_{2}italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT beyond M2subscript𝑀2M_{2}italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT may lead to the double-descent phenomenon in terms of prediction accuracy. In other words, the phenomenon can be attributed to the accuracy of the functionalization of the response.

  • (D)

    N𝑁Nitalic_N and K1subscript𝐾1K_{1}italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT: Following the same principle as (A) in the previous section, by increasing the number of basis functions for x𝑥xitalic_x beyond the sample size N𝑁Nitalic_N, a double-descent phenomenon can be observed as long as M1>Nsubscript𝑀1𝑁M_{1}>Nitalic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT > italic_N. This corresponds to interpolating unobserved parts of the functional predictor using excess basis functions.

The double-descent phenomenon in FonF model can manifest in two ways: through the functionalization of the response (scenario C) and through the interpolation of the functional predictor (scenario D). By using excess basis functions in both the predictor and response expansions, we may be able to capture more complex patterns in the functional data and improve the accuracy of the functional regression model, even when the number of basis functions exceeds the number of observation points or the sample size. This further motivates the use of excess basis functions in functional regression settings.

4. Numerical Experiments

4.1. SonF Regression

As discussed at the end of Section 3.1, the accuracy of our predictions in SonF regression can be influenced by the various interrelationships among the sample size N𝑁Nitalic_N, the number of observation points M𝑀Mitalic_M, and the number of basis functions K𝐾Kitalic_K. We investigated the prediction performances for scenarios (A) and (B) as described in Section 3.1. Table 1 summarizes the simulation settings. Although multiple criteria have been devised for basis selection, we conduct experiments with the number of bases selected through five-fold cross-validation (CV, Stone, 1974), selected by corrected AIC (cAIC, Sugiura, 1978; Bedrick and Tsai, 1994), and fixed at a value of 50505050. Note that when cAIC is used, the error terms of the regression model are assumed to be independent Gaussian.

Table 1. Summary of simulation settings and representations for scalar-on-function regression.
Symbol Description Scenario (A) Scenario (B)
N𝑁Nitalic_N Size of training dataset Variable Fixed (50505050)
Ntestsubscript𝑁testN_{\text{test}}italic_N start_POSTSUBSCRIPT test end_POSTSUBSCRIPT Size of test dataset Fixed (150150150150) Fixed (150150150150)
M𝑀Mitalic_M Number of measurements for x𝑥xitalic_x Fixed (75757575) Variable
K𝐾Kitalic_K Number of bases for x𝑥xitalic_x Variable Variable

Scenario (A)

Consider the situation where the number of observation points M𝑀Mitalic_M is larger than the sample size N𝑁Nitalic_N, discussed in Section 3.1. First, we present the data-generating process. The functions xi(s)subscript𝑥𝑖𝑠x_{i}(s)italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s ) and β(s)𝛽𝑠\beta(s)italic_β ( italic_s ) are produced by Gaussian processes (GPs) with the radial basis function kernel (RBF, Rasmussen and Williams, 2006) k(x1,x2)=θ2exp(x1x22/h2)𝑘subscript𝑥1subscript𝑥2superscript𝜃2superscriptnormsubscript𝑥1subscript𝑥22superscript2k(x_{1},x_{2})=\theta^{2}\exp(-\|x_{1}-x_{2}\|^{2}/h^{2})italic_k ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = italic_θ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_exp ( - ∥ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_h start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ), whose hyperparameters are set to (θ,h)=(10,10)𝜃1010(\theta,h)=(10,10)( italic_θ , italic_h ) = ( 10 , 10 ) and (15,10)1510(15,10)( 15 , 10 ), respectively. The generated xi(s)subscript𝑥𝑖𝑠x_{i}(s)italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s ) are then centered to have a mean of 00. We then generate yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT by adding a standard normal noise to the integral of the product of xi(s)subscript𝑥𝑖𝑠x_{i}(s)italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s ) and β(s)𝛽𝑠\beta(s)italic_β ( italic_s ). The observation vectors {𝒙i}subscript𝒙𝑖\{\bm{x}_{i}\}{ bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } are derived by selecting M=75𝑀75M=75italic_M = 75 random points from the functions plus a standard normal noise N(𝟎,IM)𝑁0subscript𝐼𝑀N(\bm{0},I_{M})italic_N ( bold_0 , italic_I start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ). We set the training data size to N=5,10𝑁510N=5,10italic_N = 5 , 10 and 20202020.

For each N𝑁Nitalic_N, we used the above procedure to generate 50505050 datasets, each with N𝑁Nitalic_N observations as a training set and 150150150150 data points as a test set, and then analyzed each dataset using natural splines (Wood, 2017; R Core Team, 2024) and (7). Specifically, for N𝑁Nitalic_N observations, we calculated (7), varying the number of bases K𝐾Kitalic_K from 4444 to 50505050. To assess the performance of the model, we computed the mean squared error (MSE) of the predictions from the true signal for the 150150150150 test data and analyzed the changes in MSE as K𝐾Kitalic_K increased.

Table 2. MSEs of scalar-on-function regressions for different basis selection methods, averaged over 50505050 simulated datasets.
Scenario (A) Scenario (B)
Method N=5𝑁5N=5italic_N = 5 N=10𝑁10N=10italic_N = 10 N=20𝑁20N=20italic_N = 20 M=5𝑀5M=5italic_M = 5 M=10𝑀10M=10italic_M = 10 M=20𝑀20M=20italic_M = 20
CV 21.992 9.650 8.654 27.549 18.852 4.833
Fixed 39.805 20.473 8.853 27.549 18.874 5.120
cAIC 54.452 22.387 8.950 28.608 18.882 5.490
Refer to caption
Refer to caption
Figure 2. Left: MSE for varying number of bases (K𝐾Kitalic_K) and sample size (N𝑁Nitalic_N) in Scenario (A). Right: Box plots showing the number of bases selected by AIC and 5-fold cross-validation (CV) in Scenario (A).

The left panel in Figure 2 illustrates, for one representative dataset, how the number of bases K𝐾Kitalic_K affects predictions when M𝑀Mitalic_M is large. Initially, the MSE increases rapidly as K𝐾Kitalic_K approaches the sample size N𝑁Nitalic_N; however, it peaks and begins to decrease when K𝐾Kitalic_K becomes larger than N𝑁Nitalic_N, exhibiting the double-descent phenomenon. Next, observe the quantitative evaluation in Table 2, whose entries represent the average values of the MSEs over 50505050 datasets. Note that since cAIC assumes a situation where the degrees of freedom are smaller than N𝑁Nitalic_N, the optimal number of basis functions selected is found before the peak. However, the prediction accuracy of the predictor with a fixed number of basis functions (K=50𝐾50K=50italic_K = 50) is superior to the case in which the number of basis functions is selected using cAIC. For CV, which solely considers the goodness of fit of the predictions, the prediction accuracy after the peak is better than before the peak, as can be seen in the right panel of Figure 2. These findings indicate that choosing a number of basis functions that is larger than the sample size is preferable in this scenario.

Refer to caption
Refer to caption
Figure 3. Left: MSE for varying numbers of bases (K𝐾Kitalic_K) and observation points (M𝑀Mitalic_M) in Scenario (B). Right: Box plots showing the number of bases selected by AIC and 5-fold cross-validation (CV) in Scenario (B).

Scenario (B)

Next, we focus on the situation where the number of observation points M𝑀Mitalic_M is smaller than the sample size N𝑁Nitalic_N. The functions 𝒙isubscript𝒙𝑖\bm{x}_{i}bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and β𝛽\betaitalic_β and the response yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT were generated in the same manner as in the previous scenario. In this setting, we generated N=50𝑁50N=50italic_N = 50 data points for the training dataset and 150150150150 data points for the test dataset, with M𝑀Mitalic_M taking on the values 5555, 10101010, and 20202020.

We produced 50505050 datasets through the above procedure and analyzed each. For each value of M𝑀Mitalic_M, in each of the datasets, we trained the parameters using (7), with K𝐾Kitalic_K natural spline bases (K𝐾Kitalic_K varied from 4444 to 50505050), on the training data and then calculated the MSE on the test data to examine how the MSE values changed as the number of basis functions K𝐾Kitalic_K increases.

The results are displayed in Figure 3 and Table 2, where the reported values are averaged over the 50505050 datasets. In this scenario, the rank of the design matrix in (6) is low, which implies that the degrees of freedom of the model remain unchanged even as the number of basis functions increases. The right panel in Figure 3 shows that CV did not choose an excess number of bases. As a result, the observed MSEs ceased to decrease at around K=M𝐾𝑀K=Mitalic_K = italic_M, suggesting that increasing the number of basis functions beyond this point is not particularly advantageous. Hence, if the number of observation points restricts the expressive power of the model of the regression model, the double-descent phenomenon does not occur.

These simulation studies demonstrate the potential benefits of using excess basis functions in SonF regression when the number of observation points is sufficiently large (Scenario A). The double-descent phenomenon is clearly observed, with the prediction accuracy improving as the number of basis functions increases beyond the sample size. However, when the number of observation points is limited (Scenario B), increasing the number of basis functions beyond the number of observation points does not lead to further improvements in prediction accuracy, and the double-descent phenomenon is not observed. These findings highlight the importance of considering the interplay between the sample size, the number of observation points, and the number of basis functions when applying scalar-on-function regression in practice. The use of excess basis functions, combined with the minimum norm interpolator, can be a valuable approach for improving prediction accuracy in scenarios where the number of observation points is sufficiently large relative to the sample size.

4.2. FonF Regression

As discussed in Section 3.2, the prediction accuracy in FonF regression is influenced by the interplay between the sample size N𝑁Nitalic_N, the number of observation points for the predictor and response functions (M1subscript𝑀1M_{1}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and M2subscript𝑀2M_{2}italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT), and the number of basis functions for the predictor and response functions (K1subscript𝐾1K_{1}italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and K2subscript𝐾2K_{2}italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT). We now demonstrate Scenarios (C) and (D) through the following numerical experiments. The settings are summarized in Table 3.

Table 3. Summary of simulation settings and symbols for function-on-function regression.
Symbol Description Scenario (C) Scenario (D)
N𝑁Nitalic_N Size of training dataset Fixed (50505050) Variable
Ntestsubscript𝑁testN_{\text{test}}italic_N start_POSTSUBSCRIPT test end_POSTSUBSCRIPT Size of test dataset Fixed (150150150150) Fixed (150150150150)
M1subscript𝑀1M_{1}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT Number of measurements for x𝑥xitalic_x Fixed (75757575) Fixed (75757575)
M2subscript𝑀2M_{2}italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT Number of measurements for y𝑦yitalic_y Variable Fixed (75757575)
K1subscript𝐾1K_{1}italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT Number of bases for x𝑥xitalic_x Fixed (10101010) Variable
K2subscript𝐾2K_{2}italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT Number of bases for y𝑦yitalic_y Variable Fixed (10101010)

Scenario (C)

Here, we investigate the relationship between K2subscript𝐾2K_{2}italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (number of basis functions for the response function y𝑦yitalic_y) and M2subscript𝑀2M_{2}italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (number of observation points for y𝑦yitalic_y). We consider the scenario where both the predictor x𝑥xitalic_x and the response y𝑦yitalic_y are functions. Specifically, we sampled x𝑥xitalic_x from a GP whose kernel is an RBF having hyperparameters (θ,h)=(10,10)𝜃1010(\theta,h)=(10,10)( italic_θ , italic_h ) = ( 10 , 10 ) and centered it to be zero-mean. For every t𝑡titalic_t, we sampled β(,t)𝛽𝑡\beta(\cdot,t)italic_β ( ⋅ , italic_t ) from a GP with an RBF kernel having hyperparameters (θ,h)=(15,10)𝜃1510(\theta,h)=(15,10)( italic_θ , italic_h ) = ( 15 , 10 ). The true response function was generated by integrating the product of β(s,t)𝛽𝑠𝑡\beta(s,t)italic_β ( italic_s , italic_t ) and xi(t)subscript𝑥𝑖𝑡x_{i}(t)italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) as (8), and the observations {𝒚i}subscript𝒚𝑖\{\bm{y}_{i}\}{ bold_italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } were given by adding standard normal noise to M2subscript𝑀2M_{2}italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT points extracted from the function. Moreover, the observation vectors {𝒙i}subscript𝒙𝑖\{\bm{x}_{i}\}{ bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } are derived by randomly selecting M1=75subscript𝑀175M_{1}=75italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 75 points from the functions and adding standard normal noise. For each M2=5,10subscript𝑀2510M_{2}=5,10italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 5 , 10 and 20202020, we generated N=50𝑁50N=50italic_N = 50 observations as a training set and 150150150150 values as a test set.

For each value of M𝑀Mitalic_M, we generated 50505050 datasets using the above procedure and analyzed each dataset using natural splines and (10) on the training sample of size N𝑁Nitalic_N, fixing K1subscript𝐾1K_{1}italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT at 10101010 and varying K2subscript𝐾2K_{2}italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT from 4444 to 50505050. We then examined the relationship between the number of basis functions K2subscript𝐾2K_{2}italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT of the response function and the MSE for the test data.

Table 4. MSEs of function-on-function regressions for different basis selection methods, averaged over 50505050 simulated datasets.
Scenario (C) Scenario (D)
Method M2subscript𝑀2M_{2}italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT=5 M2subscript𝑀2M_{2}italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT=10 M2subscript𝑀2M_{2}italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT=20 N=5𝑁5N=5italic_N = 5 N=10𝑁10N=10italic_N = 10 N=20𝑁20N=20italic_N = 20
CV 9.263 8.652 10.744 8.021 5.257 3.500
Fixed 9.976 9.018 10.881 8.033 5.399 3.502
cAIC 337.620 99.037 11.761 18.746 9.602 8.214
Refer to caption
Refer to caption
Figure 4. Left: MSE for varying numbers of bases for the response (K2subscript𝐾2K_{2}italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT) and varying numbers of measurements for the response (M2subscript𝑀2M_{2}italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT) in scenario (C). Right: Box plots showing the number of bases selected by AIC and 5-fold cross-validation (CV) in Scenario (C).
Refer to caption
Refer to caption
Figure 5. Left: MSE for varying numbers of bases for the predictor (K1subscript𝐾1K_{1}italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT) and training sample sizes (N𝑁Nitalic_N) in scenario (D). Right: Box plots showing the number of bases selected by AIC and 5-fold cross-validation (CV) in Scenario (D).

The left panel in Figure 4 illustrates the change in the MSE values with the increasing number of bases for a representative dataset. The results show that the MSE value reaches its maximum when the number of bases of the response function equals the size of the training sample, after which the MSE decreases. As the individual prediction targets are functions, the prediction (interpolation of the predicted function) improves with an increase in the number of bases. Table 4 shows the results when K2subscript𝐾2K_{2}italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is selected by CV and cAIC, respectively. As indicated, the basis selection via cAIC results in poor prediction performance. In particular, M2=5subscript𝑀25M_{2}=5italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 5 and M2=10subscript𝑀210M_{2}=10italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 10 fail to predict the response function either because the number of bases is too small to represent the function or because of overfitting. This poor performance can be attributed to the fact that the basis of the response function itself is considered, suggesting that the choice of basis is particularly sensitive in this scenario. In contrast, CV, choosing a large number of bases (the right panel in Figure 4) or fixing the number of bases at large values contributes to good interpolation performance.

Scenario (D)

In this section, we examine the relationship between K1subscript𝐾1K_{1}italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT (number of basis functions for the predictor function x𝑥xitalic_x) and N𝑁Nitalic_N (sample size). The generating process for the functions x𝑥xitalic_x, β𝛽\betaitalic_β, and y𝑦yitalic_y in Equation (8) is the same as in the previous section. The observation vectors 𝒙isubscript𝒙𝑖{\bm{x}_{i}}bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and 𝒚isubscript𝒚𝑖{\bm{y}_{i}}bold_italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are both obtained by randomly selecting 75757575 points from the functions xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, respectively, and adding centered Gaussian noise with unit variance. For N=5,10𝑁510N=5,10italic_N = 5 , 10, and 20202020, we generated N𝑁Nitalic_N observations as the training set and 150150150150 observations as the test set.

For each N𝑁Nitalic_N, we generated 50505050 datasets and analyzed each one using (10), varying K1subscript𝐾1K_{1}italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT from 4444 to 50505050 and fixing K2=10subscript𝐾210K_{2}=10italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 10 natural spline basis functions. We investigated the relationship between the number of basis functions for the predictor function K1subscript𝐾1K_{1}italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and the MSE for different training sample sizes N𝑁Nitalic_N.

The simulation results are given in Figure 5 and Table 4. The left panel in Figure 5 illustrates the change in MSE with sample size N𝑁Nitalic_N for a representative dataset. Once again, the double-descent phenomenon is observed in this scenario. This result is essentially the same as in Scenario (A), as it involves the relationship between the sample size and the number of basis functions for the predictor function (although the number of observation points M1subscript𝑀1M_{1}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT must be greater than N𝑁Nitalic_N). The right panel in Figure 5 shows that CV tends to select excess bases; Table 4 confirms that, as before, using a larger number of basis functions results in better prediction accuracy.

These simulation results highlight the benefits of using excess basis functions in the FonF model for both the response function (Scenario C) and the predictor function (Scenario D). The double-descent phenomenon is evident in both scenarios, with the prediction accuracy improving as the number of basis functions increases beyond the sample size or the number of observation points for the response function. These findings underscore the practical importance of considering the interplay between the sample size, number of observation points, and number of bases.

5. Application to real datasets

This section provides examples of the double-descent phenomenon in functional regression, as evidenced by empirical data. We examine Scenario (A) across two commonly used datasets.

5.1. Gasoline Dataset

First, we focused on the “gasoline” dataset, stored in the R language “refund” package (Goldsmith et al., 2024). This dataset comprises octane numbers for 60 gasoline samples and their near-infrared reflectance spectra. The octane number serves as a scalar indicator, quantifying the combustion quality of the gasoline, and the 401 near-infrared reflectance spectra represent the molecular structure of the substance.

In this analysis, following Reiss and Ogden (2007) and Reiss and Ogden (2009), we treated a set of near-infrared reflectance spectra as a functional explanatory variable and considered the problem of predicting the octane number, treated as a response, based on the minimum norm interpolator (7), varying the number of natural spline basis functions We randomly selected 10 observations as the training data and calculated the MSE of the predictions on the remaining 50 observations, which served as the test data.

The MSEs with varying numbers of basis functions are shown in Figure 6. As can be seen in the figure, the MSE peaks at the same point as the size of the training sample (10101010) and then gradually decreases. Notably, when the number of basis functions exceeds 50, the MSE becomes smaller than when fewer basis functions are used. This outcome suggests that leveraging a large number of basis functions can indeed enhance prediction accuracy for real data, as evidenced by the double-descent phenomenon shown here.

Refer to caption
Figure 6. Relationship between the number of bases and MSE for gasoline dataset.

5.2. Diffusion Tensor Imaging Dataset

Next, we address the diffusion tensor imaging (DTI) dataset, which is commonly used in functional data analysis and is stored as “DTI” in the R language “refund” package. DTI is a modality based on magnetic resonance imaging (MRI) that allows the diffusion of water in the brain to be tracked. One hundred patients are scanned for DTI approximately once a year and undergo the PASAT (Paced Auditory Serial Addition Test), a neuropsychological test used to assess cognitive function.

Within this framework, following Goldsmith et al. (2011) and Goldsmith et al. (2012), we considered the fractional anisotropy tract profiles of the corpus callosum area (CCA) as a functional explanatory variable to predict the subject’s PASAT score as a response. Although patients may visit the clinic multiple times, each visit is treated as a distinct data point; data with missing values were removed. This resulted in a sample size N=334𝑁334N=334italic_N = 334, with 93939393 observation points (i.e., M=93𝑀93M=93italic_M = 93) for the explanatory variable CCA. We performed predictions based on (7) with natural spline bases, varying the number of bases. A training sample of size 20 was used, with the remaining 314 observations serving as the test data.

As illustrated in Figure 7, the double-descent is evident. The MSE peaks at approximately the same value as the training sample size and drops smoothly from there. In this case, the MSE does not decrease as much as in the gasoline dataset, possibly because the functional form of the explanatory variable is simple, and a few basis functions are sufficient to represent the function. However, the double-descent phenomenon clearly occurs, indicating the risk of conventionally searching solely for a smaller number of basis functions than the size of the training sample based on the idea of preventing overfitting.

Refer to caption
Figure 7. Relationship between the number of bases and MSE for the DTI dataset.

6. Discussion

This study questions the conventional notion that the number of basis functions should be smaller than the number of observation points and asserts the benefits of considering an excess number of basis functions in the FDA. In Section 3, we argue that in functional regression, if one uses a number of basis functions above a certain threshold, the double-descent phenomenon can be observed and better prediction accuracy can potentially be achieved. We demonstrated this phenomenon through numerical experiments and found that optimal prediction accuracy can be realized to the right of the peak of the double-descent curve. Importantly, this phenomenon is not merely the subject of theoretical analysis or numerical experiments but can also be observed in real-world datasets. In both the gasoline and DTI datasets, a clear double-descent was observed, with the gasoline dataset producing optimal prediction accuracy beyond the peak. These findings provide valuable guidance in the analysis of functional data, strongly suggesting that when selecting the number of basis functions, one should consider a wider range of possibilities and not be limited by the sample size or the number of observation points.

Future research should extend investigations of the practicality of this phenomenon to different types of datasets and models, including functional time series. Additionally, beyond the minimum norm interpolator, the advantage of excess basis functions may be further supported by ridge regression, although this would require tuning parameter selection. Moreover, the theoretical foundations of the double-descent phenomenon in functional data analysis should be more deeply explored. While this study provided empirical evidence and intuitive explanations, a rigorous mathematical analysis of the conditions under which the phenomenon occurs and its relationship to the properties of the functional data and the chosen basis functions would strengthen the understanding and applicability of our findings.

Computer Programs

The computer programs used in this manuscript to demonstrate the double-descent curve in Section 4 and the application presented in Section 5 have been developed for execution in the R statistical computing environment. These programs are publicly available at the GitHub repository: https://github.com/TomWaka/DD-FDR.

Acknowledgements

T. Wakayama was supported by JSPS KAKENHI (22J21090) and H. Matsui was supported by JSPS KAKENHI (23K11005).

References

  • Akaike [1973] Hirotugu Akaike. Information theory and an extension of the maximum likelihood principle. Second international symposium on information theory, 1:267–281, 1973.
  • Araki et al. [2009] Yuko Araki, Sadanori Konishi, Shuichi Kawano, and Hidetoshi Matsui. Functional regression modeling via regularized Gaussian basis expansions. Annals of the Institute of Statistical Mathematics, 61:811–833, 2009. doi: 10.1007/s10463-007-0161-1.
  • Banerjee and Roy [2014] Sudipto Banerjee and Anindya Roy. Linear Algebra and Matrix Analysis for Statistics. Chapman and Hall/CRC, 1st edition, 2014. doi: 10.1201/b17040.
  • Bartlett et al. [2020] Peter L Bartlett, Philip M Long, Gábor Lugosi, and Alexander Tsigler. Benign overfitting in linear regression. Proceedings of the National Academy of Sciences, 117(48):30063–30070, 2020. doi: 10.1073/pnas.1907378117. URL https://doi.org/10.1073/pnas.1907378117.
  • Bedrick and Tsai [1994] Edward J. Bedrick and Chih-Ling Tsai. Model Selection for Multivariate Regression in Small Samples. Biometrics, 50(1):226–231, 1994. doi: 10.2307/2533213. URL https://doi.org/10.2307/2533213.
  • Belkin et al. [2018] Mikhail Belkin, Siyuan Ma, and Soumik Mandal. To Understand Deep Learning We Need to Understand Kernel Learning. In Proceedings of the 35th International Conference on Machine Learning, volume 80, pages 541–549. PMLR, 2018. URL https://proceedings.mlr.press/v80/belkin18a.html.
  • Belkin et al. [2019] Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine-learning practice and the classical bias–variance trade-off. Proceedings of the National Academy of Sciences, 116(32):15849–15854, 2019. doi: 10.1073/pnas.1903070116. URL https://doi.org/10.1073/pnas.1903070116.
  • Belkin et al. [2020] Mikhail Belkin, Daniel Hsu, and Ji Xu. Two Models of Double Descent for Weak Features. SIAM Journal on Mathematics of Data Science, 2(4):1167–1180, 2020. doi: 10.1137/20M1336072. URL https://doi.org/10.1137/20M1336072.
  • Fujii and Konishi [2006] Toru Fujii and Sadanori Konishi. Nonlinear regression modeling via regularized wavelets and smoothing parameter selection. Journal of Multivariate Analysis, 97(9):2023–2033, 2006. doi: https://doi.org/10.1016/j.jmva.2005.12.009. URL https://www.sciencedirect.com/science/article/pii/S0047259X06000856.
  • Goldsmith et al. [2011] Jeff Goldsmith, Jennifer Bobb, Ciprian M Crainiceanu, Brian Caffo, and Daniel Reich. Penalized Functional Regression. Journal of Computational and Graphical Statistics, 20(4):830–851, 2011. doi: 10.1198/jcgs.2010.10007. URL https://doi.org/10.1198/jcgs.2010.10007.
  • Goldsmith et al. [2012] Jeff Goldsmith, Ciprian M Crainiceanu, Brian Caffo, and Daniel Reich. Longitudinal penalized functional regression for cognitive outcomes on neuronal tract measurements. Journal of the Royal Statistical Society: Series C (Applied Statistics), 61(3):453–469, 2012. doi: https://doi.org/10.1111/j.1467-9876.2011.01031.x. URL https://rss.onlinelibrary.wiley.com/doi/abs/10.1111/j.1467-9876.2011.01031.x.
  • Goldsmith et al. [2024] Jeff Goldsmith, Fabian Scheipl, Lei Huang, Julia Wrobel, Chongzhi Di, Jonathan Gellar, Jaroslaw Harezlak, Mathew W. McLean, Bruce Swihart, Luo Xiao, Ciprian Crainiceanu, Philip T. Reiss, and Erjia Cui. refund: Regression with Functional Data, 2024. URL https://CRAN.R-project.org/package=refund. R package version 0.1-35.
  • Green and Silverman [1993] Peter J. Green and Bernard W. Silverman. Nonparametric Regression and Generalized Linear Models: A roughness penalty approach. Chapman and Hall/CRC, 1 edition, 1993. doi: 10.1201/b15710. URL https://doi.org/10.1201/b15710.
  • Hastie and Mallows [1993] Trevor Hastie and Colin Mallows. [A Statistical View of Some Chemometrics Regression Tools]: Discussion. Technometrics, 35(2):140–143, 1993. doi: 10.2307/1269658.
  • Hastie et al. [2009] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, volume 2. Springer New York, 2009. ISBN 978-0-387-84858-7. doi: 10.1007/978-0-387-84858-7. URL https://doi.org/10.1007/978-0-387-84858-7.
  • Hastie et al. [2022] Trevor Hastie, Andrea Montanari, Saharon Rosset, and Ryan J Tibshirani. Surprises in high-dimensional ridgeless least squares interpolation. The Annals of Statistics, 50(2):949 – 986, 2022. doi: 10.1214/21-AOS2133. URL https://doi.org/10.1214/21-AOS2133.
  • Horváth and Kokoszka [2012] Lajos Horváth and Piotr Kokoszka. Inference for Functional Data with Applications. Springer New York, 2012. doi: 10.1007/978-1-4614-3655-3. URL https://doi.org/10.1007/978-1-4614-3655-3.
  • Hsing and Eubank [2015] Tailen Hsing and Randall Eubank. Theoretical Foundations of Functional Data Analysis, with an Introduction to Linear Operators. John Wiley & Sons, Ltd, 2015. doi: 10.1002/9781118762547. URL https://doi.org/10.1002/9781118762547.
  • James et al. [2021] Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani. An Introduction to Statistical Learning: with Applications in R. Springer New York, 2021. doi: 10.1007/978-1-0716-1418-1. URL https://doi.org/10.1007/978-1-0716-1418-1.
  • Kokoszka and Reimherr [2017] Piotr Kokoszka and Matthew Reimherr. Introduction to Functional Data Analysis. Chapman and Hall/CRC, 2017. doi: 10.1201/9781315117416. URL https://doi.org/10.1201/9781315117416.
  • Konishi and Kitagawa [1996] Sadanori Konishi and Genshiro Kitagawa. Generalised information criteria in model selection. Biometrika, 83(4):875–890, 1996.
  • Matsui et al. [2009] Hidetoshi Matsui, Shuichi Kawano, and Sadanori Konishi. Regularized functional regression modeling for functional response and predictors. Journal of Math-for-Industry, 1:17–25, 2009.
  • Misiakiewicz and Montanari [2023] Theodor Misiakiewicz and Andrea Montanari. Six lectures on linearized neural networks. arXiv preprint arXiv:2308.13431, 2023.
  • Müller [2005] Hans-georg Müller. Functional Modelling and Classification of Longitudinal Data. Scandinavian Journal of Statistics, 32(2):223–240, 2005. doi: https://doi.org/10.1111/j.1467-9469.2005.00429.x. URL https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1467-9469.2005.00429.x.
  • R Core Team [2024] R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2024. URL https://www.R-project.org/.
  • Ramsay and Dalzell [1991] James O. Ramsay and Catherine J Dalzell. Some Tools for Functional Data Analysis. Journal of the Royal Statistical Society: Series B (Methodological), 53(3):539–561, 1991. doi: 10.1111/j.2517-6161.1991.tb01844.x. URL https://doi.org/10.1111/j.2517-6161.1991.tb01844.x.
  • Ramsay and Silverman [2005] James O. Ramsay and Bernard W. Silverman. Functional Data Analysis. Springer New York, 2 edition, 2005. doi: 10.1007/b98888. URL https://doi.org/10.1007/b98888.
  • Rasmussen and Williams [2006] Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian Processes for Machine Learning. MIT press Cambridge, MA, 2006.
  • Reiss and Ogden [2007] Philip T. Reiss and R. Todd Ogden. Functional Principal Component Regression and Functional Partial Least Squares. Journal of the American Statistical Association, 102(479):984–996, 2007. doi: 10.1198/016214507000000527.
  • Reiss and Ogden [2009] Philip T. Reiss and R. Todd Ogden. Smoothing parameter selection for a class of semiparametric linear models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 71(2):505–523, 2009. doi: https://doi.org/10.1111/j.1467-9868.2008.00695.x. URL https://rss.onlinelibrary.wiley.com/doi/abs/10.1111/j.1467-9868.2008.00695.x.
  • Schaeffer et al. [2024] Rylan Schaeffer, Zachary Robertson, Akhilan Boopathy, Mikail Khona, Kateryna Pistunova, Jason William Rocks, Ila R Fiete, Andrey Gromov, and Sanmi Koyejo. Double Descent Demystified: Identifying, Interpreting & Ablating the Sources of a Deep Learning Puzzle. In The Third Blogpost Track at ICLR 2024, 2024. URL https://openreview.net/forum?id=muC7uLvGHr.
  • Schwarz [1978] Gideon Schwarz. Estimating the Dimension of a Model. The Annals of Statistics, 6(2):461 – 464, 1978. doi: 10.1214/aos/1176344136.
  • Stone [1974] Mervyn Stone. Cross-validatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society: Series B (Methodological), 36(2):111–133, 1974.
  • Sugiura [1978] Nariaki Sugiura. Further analysis of the data by Akaike’s information criterion and the finite corrections. Communications in Statistics - Theory and Methods, 7(1):13–26, 1978. doi: 10.1080/03610927808827599. URL https://doi.org/10.1080/03610927808827599.
  • Wakayama and Sugasawa [2024] Tomoya Wakayama and Shonosuke Sugasawa. Functional Horseshoe Smoothing for Functional Trend Estimation. Statistica Sinica, 34(3), 2024. doi: 10.5705/ss.202022.0297.
  • Wang et al. [2016] Jane-Ling Wang, Jeng-Min Chiou, and Hans-Georg Müller. Functional data analysis. Annual Review of Statistics and Its Application, 3:257–295, 2016. doi: https://doi.org/10.1146/annurev-statistics-041715-033624. URL https://www.annualreviews.org/content/journals/10.1146/annurev-statistics-041715-033624.
  • Wood [2017] Simon N. Wood. Generalized Additive Models: An Introduction with R. Chapman and Hall/CRC, 2 edition, 2017. doi: 10.1201/9781315370279. URL https://doi.org/10.1201/9781315370279.
  • Zhang et al. [2021] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning (still) requires rethinking generalization. Communications of the ACM, 64(3):107–115, 2021. doi: 10.1145/3446776. URL https://doi.org/10.1145/3446776.