1 Introduction

Evaluations of prospective public health interventions can be strengthened by collecting outcomes from representative samples of the study population. It would not be surprising for there to be spatial patterns of association among individuals sampled from clusters of households in a neighborhood where an intervention was implemented, nor would it be surprising for there to be missing data on key study covariates. This manuscript outlines a strategy for evaluating community-based interventions that draws on frameworks for inference from finite-population sampling, Bayesian analysis of spatial process models, and methods for handling missing data.

Our primary application is provided by a study of a “corner store conversion” intervention implemented in under-resourced areas of Los Angeles (Ortega et al., 2015). In the realm of food purchasing, disparities by income exist in fruit and vegetable (FV) consumption (Grimm et al., 2012), nutrition (Casey et al., 2001), and overall food insecurity (Ribar & Hamrick, 2003; Rose, 1999). These problems are observed in “food swamps”, communities with higher numbers of unhealthy establishment that serve fast-food or sell junk food (Rose et al., 2009) than stores with healthy food options. Corner store interventions are one public health strategy to change the food environment in the hope of improving eating behaviors at the individual and community level (Langellier et al., 2013). To facilitate improved FV sales, such interventions commonly increase the amount of fresh FVs sold in a store (Langellier et al., 2013), and may provide refrigeration units (Paek et al., 2014), store remodeling (Langellier et al., 2013), cooking demonstrations (Ortega et al., 2015), increased signage (Lawman et al., 2015), and business consulting (Ortega et al., 2015). Among these studies, findings regarding availability and sales of fruits, vegetables, and other healthy foods have been mixed (Albert et al., 2017; Lawman et al., 2015; Paek et al., 2014; Song et al., 2009; Thorndike et al., 2017).

The focus of analysis in these interventions have been the patrons of these corner stores while few studies have examined the effect of these interventions at the community level. Notably, in such an intervention in two low-income, predominantly Latino communities in California, East Los Angeles and Boyle Heights, Ortega et al. (2016) reported no significant improvements to FV purchasing or consumption. However, one variable of interest, the percentage of annual reported income spent on fruits and vegetables (PIFV), was not fully investigated in earlier reports due to complexities associated with the high rate of missing data on reported income. As many of the intervention components sought to influence the community context, it is important to assess the extent to which any intervention effect was discernible with attention to the potential for available data to be incomplete.

Non-response of household income is a common occurrence in survey research (Schenker et al., 2006; Watson & Starick, 2011; Yan et al., 2010), but any method for handling missing data must address two key challenges. First, there is evidence that reported income is spatially associated in neighborhoods (Breau et al., 2018; Chakravorty, 1996). One approach to account for this is to employ spatial process modeling (Banerjee et al., 2014; Cressie & Wikle, 2011; Ripley, 2004), embedded within a Bayesian inference framework, where inferences flow from averaging over (i.e., carrying out iterative-simulation-based numerical integration applicable to) joint distributions of observable values and unobserved parameters that encode conditional-independence assumptions in a generic framework such as

$$\begin{aligned}{}[\text{ data, } \text{ process, } \text{ parameters}] = [\text{ data }\mid \text{ process}] \times [\text{ process }\mid \text{ parameters}] \times [\text{ parameters}]\; . \end{aligned}$$
(1)

Here, the data are assumed to be a partial realization of a Gaussian stochastic process, where the covariance between elements are defined by \(C(d_{ab})\), a function of the distance, \(d_{ab}\) between any two locations \(\varvec{\ell }_a\) and \(\varvec{\ell }_b\). While there are many valid ways to represent such structure, a flexible choice is the Matérn family (Rasmussen & Williams, 2006) of functions, defined as \(C(d_{ab}) = \sigma ^2 + \delta ^2\) if \(d_{ab}= 0\) and \(C(d_{ab}) = \frac{\delta ^2}{2^{\nu -1}\Gamma (\nu )}\big (\sqrt{2\nu }d_{ab}\phi \big )^\nu K_\nu \big (\sqrt{2\nu }d_{ab}\phi \big )\) if \(d_{ab} > 0\), where \(K_{\nu }(\cdot )\) is the modified Bessel function. Here \(\nu \) is a smoothness parameter, \(\sigma ^2\) describes the variation due to measurement error, \(\delta ^2\) measures the spatial variance, \(\phi \) is a decay parameter which determines the rate of decline in spatial association. The exponential function, \(C(d_{ab}) = \delta ^2 \mathrm{{exp}}(-\phi d_{ab})\) if \(d_{ab} > 0\), is a special case when \(\nu = 1/2\). Unlike the literature of small area estimation (Clayton & Kaldor, 1987; Ghosh & Rao, 1994; Rao, 2003), where the sampling units are regions such as counties, states or census-tracts, spatial process models consider quantities that, at least conceptually, exist in continuum over the entire domain.

A recent application of Bayesian spatial techniques to high dimensional survey is given by Bradley et al. (2015), who employed a multivariate spatio-temporal mixed effects model to examine differences in monthly income by gender, finding that men have larger average incomes in multiple industries, with the largest differences in the finance and insurance fields. Such models have effect dimension reduction by applying Moran’s I basis functions to the spatio-temporal setting and outperform univariate spatial models in mean-square prediction error. Bradley et al. (2016) furthered this model by developing a hierarchical Bayesian approach to survey fusion, assuming a latent process shared by each survey dataset. Using data from the American Community Survey and Local Area Unemployment Statistics, they demonstrated higher precision in estimates of unemployment compared to analyzing each dataset separately. Bradley et al. (2016) also developed a Bayesian technique to account for a spatial change of support in count data and incorporate known survey variances.

A second challenge is that non-response to income questions might depend on underlying income values and associated demographic characteristics. Greenlees et al. (1982), David et al. (1986) and Riphahn and Serfling (2005) all noted evidence from population-based surveys that individuals with higher incomes were less likely to respond, although in surveys of lower income communities, it is plausible that the direction of the association between income and non-response would be reversed. As we suspect our outcome is spatially associated, however, we turn to the recent literature regarding preferential sampling to better understand this problem.

First described in Diggle et al. (2010), preferential sampling is a technique in which the probability of selection on a spatial domain increases as a function of intensity of the measurement. Diggle and colleagues present a joint model in which the selection sites and the measured values arise from the same spatial process. Pati et al. (2011) presents a model for preferential sampling in a fully Bayesian framework by including a function of intensity as a predictor of the outcome to account for informative sampling. In addition, preferential sampling has been shown to give biased predictions (Gelfand et al., 2012; Lee et al., 2015) and parameter estimation (Antonelli et al., 2016). In our corner-store scenario, we consider “preferential” response, in which the probability of a spatially associated variable being reported is dependent on the underlying value of that variable.

Finally, as we are interested in estimating average percentage of income spent on fruits and vegetables for all individuals in a community, we examine the problem from a finite-population perspective, considering those who reported income to be the sampled or observed cases. Finite-population survey sampling (Cochran, 1977; Hartley & Sielken, 1975; Horvitz & Thompson, 1952; Royall, 1970) considers sampling designs in the statistical modeling and inference on finite populations. Bayesian models (Ericson, 1969; Gelman, 2007; Ghosh & Meeden, 1997) can incorporate aspects of study design and often perform better with small datasets while yielding similar results to design-based results in large datasets (Little, 2004). Estimation of finite-population quantities within spatial process settings has not received much attention in the literature. Recent work includes a method for block kriging which connects geostatistical models and classical design-based sampling (Hoef, 2002), a spline-based estimator of the mean for samples drawn from a spatially-correlated population (Cicchitelli & Montanari, 2012), and the use of linear spatial interpolator to create a design-based predictor of values at unobserved locations (Bruno et al., 2013). Chan-Golston et al. (2020) demonstrated that accounting for both design and spatial association in a two-stage sampling context led to better model fit and better coverage of the finite-population parameters.

The rest of the paper is as follows: Sect. 2 elaborates on data collected during the corner-store intervention described in Ortega et al. (2016) and provides an in depth explanation of the income non-response by community, Sect. 3 presents a Bayesian framework that accommodates preferential non-response, and Sect. 4 examines a simulation study of the proposed framework. Section 5 presents a data analysis to assess the extent of any intervention effect on the percentage of income spent on fruits and vegetables utilizing model-based finite-population estimates of the outcome of interest both pre-intervention and post-intervention. The paper concludes with a discussion in Sect. 6.

2 Motivating application

Supported by NIH center-grant funding focused on reducing population-based health disparities, and with input from a consultant who had experience with previous corner-store conversions in Northern California, researchers identified 4 pairs of corner stores in the East Los Angeles and Boyle Heights communities of Los Angeles to compare the active intervention with a control intervention. The active intervention of “corner-store conversion” included a reorganization of store items to promote healthy food purchasing, an external transformation of the store, a social marketing campaign and cooking demonstrations put on by local youth, connections to local wholesale markets, and refrigeration units (Ortega et al., 2016). Both the active and control interventions provided training to improve bookkeeping and accounting. A more detailed review of the study design and implementation is provided by Ortega et al. (2015). To assess the potential effects of this intervention, a survey was given to residents within a specified radius of each of the eight corner stores. This community survey sought to extensively catalog the food purchasing of residents, including where they shopped, what types of food they bought, and who was being supported by their food purchases. As such, the survey was directed to adults who were identified as the main food purchaser of the family. Many other items were also collected, including demographic characteristics, health problems, family history of residency, and government food program participation (such as the Supplemental Nutrition Assistance Program and the Special Supplemental Program for Women, Infants, and Children). This survey was conducted in each of the eight communities surrounding the store (generally a 2–3 block radius) before the conversion and then again roughly one year after the conversion. There were 1035 observations collected at baseline and 1052 observations collected at follow-up, with approximately 60% of the individuals surveyed at baseline surveyed again at follow-up.

While there is a strong interest in describing PIFV in each community, the sample had high levels of missingness in income (one-third) at both baseline and follow-up, which are presented in Table 1. Noticeably, PIFV is highest on average at baseline in Communities 1 and 7, 26.0% and 46.5%, respectively, which also observed lower levels of response and income compared to the averages of the total. With high levels of non-response, it is important to know if this value is being inflated due to the missing values of income. In addition, while the number of sampled units ranged from 114 to 143, the percentage of missingness ranged widely from 4.9 to 66.6%. For this paper, we consider the sampled data to be the finite population of eight communities in East Los Angeles and Boyle Heights. This is a reasonable assumption, as the response rate of 80% and 71% at baseline and follow-up suggest that a majority of individuals in these communities are represented in this dataset. Amount spent on fruits and vegetables was reported on weekly, bi-weekly, or monthly scale. These values were multiplied by 52, 26, and 12, respectively, to reflect the annual amount spent on fruits and vegetables in a household. Reported yearly income is continuous and ranged from $0 to $300,000. Twenty-four individuals reported a higher amount spent on fruits and vegetables than their income, so their income was imputed to the value spent on FV, so that PIFV was no more than 100 and no annual income was equal to 0. Both annual income and annual amount spent on FV were log-transformed to produce a more normal distribution of the outcome. The analysis was restricted to cases with no missing covariates and with a recorded amount spent on FV. This resulted in a final dataset with 982 observations at baseline and 1033 at follow-up.

Table 1 Annual income and FV expenditures by site and time-point

Other individual-level variables that were hypothesized to affect PIFV were age at time of interview, gender, household size, marital status (collapsed into a binary classification distinguishing other possibilities from being in a marriage or marriage-like relationship), and education level (collapsed into a binary classification of at least a high-school education or less than a high-school education). Due to the homogeneity of ethnicity in the sample, Latino ethnicity was not considered in the analyses. Summary statistics of these covariates by time-point are presented in Table 2.

Table 2 Description of the sample data by time-point

Individual locations (addresses) were provided and geographic coordinates were assigned to each address. As there were multiple apartment complexes in these communities, individuals living in different units of the same complex were assigned the same geographic coordinates. Thus, among the 8 communities, there were 635 identified locations. At baseline, 518 of these locations were observed, 366 of these locations had a least one individual who reported their income, and on average 1.90 individuals shared the same location. At follow-up, 562 of these locations were observed, 472 of these locations had a least one individual who reported their income, and on average 2.38 individuals shared the same location. Considering both time-points, 555 locations had at least one reported income.

Fig. 1
figure 1

Variograms of income (log-scale), amount spent on FV, and PIFV

Variograms of the PIFV outcome, amount spent on FV, and log-income were constructed. All variograms suggested evidence of spatial association, as shown in Fig. 2. To explore our primary outcome and determine if there is any evidence of preferential response in income, a linear model was first fit using the previously described covariates, as well as a indicators for time-point, intervention status, and the interaction of these two indicators to detect an interaction effect, predicting the log-percent of income spent on fruits and vegetables. For individuals who did not report income, predictions of this log-percent were made using the results of the linear model. By dividing the reported amount spent on fruits and vegetables by this percent, we have constructed a prediction of income for the non-respondents. Then, a logistic regression model predicting the response of income was fit with an intercept term and income (either the reported value for those that responded or the predicted value from the linear model for those that did not respond). This model found income to be significantly associated with the probability of response. An observed coefficient estimate of 0.12 (SE = 0.05) suggests that individuals with higher values of income are more likely to report income, and, conversely, lower income in these communities are more likely to be under-reported. Further, a logistic regression model with random intercepts for location was fit and the standard deviation corresponding to the random intercept was 0.67.

Fig. 2
figure 2

Linear interpolation plots from full simulated data and 3 scenarios

3 Representing spatial structure in finite-population inference

3.1 A general framework

Formally, define a spatial domain \(\pmb {\mathscr {L}} \subseteq \pmb {{\mathfrak {R}}}^2\), where a finite population of size T is located in N locations, \(\pmb {\mathscr {L}}_{FP} = \{\varvec{\ell }_1, \dots , \varvec{\ell }_N\}\), \(T \ge N\). Suppose there are \(M_i\) units at the \(i^{th}\) location, hence \(T = \sum _{i=1}^N M_i\). Further, suppose that t, \(t \le T\), units are sampled from the finite population and thus n, \(n \le N\), locations are represented in this sample. Taking the first n locations to be sampled, define the sampled and nonsampled location sets as \(\pmb {\mathscr {L}}_s = \{\varvec{\ell }_1, \dots , \varvec{\ell }_n\}\) and \(\pmb {\mathscr {L}}_{ns} = \{\varvec{\ell }_{n+1}, \dots , \varvec{\ell }_N\}\), respectively. In addition, denoting \(m_i\) the number of sampled units at the \(i^{th}\) location, \(i = 0, \dots , N\), we have that \(t = \sum _{i=1}^N m_i = \sum _{i=1}^n m_i\), as \(m_i = 0\) for \(i = n+1, \dots , N\). In the context of our data, we have that \(T=2015\), \(t=1294\), \(N=635\), and \(n=555\). We are interested in measuring annual reported income on the natural log-scale, \({\mathbf {y}}\), which is a vector of sampled and nonsampled measurements, e.g., \({\mathbf {y}} = [{\mathbf {y}}_s^{\top }, {\mathbf {y}}_{ns}^{\top }]^{\top }\). Denoting \(y_{j}(\varvec{\ell }_i)\) as the annual income on the natural log scale of the \(j^{th}\) individual at the \(i^{th}\) location, let \({\mathbf {y}}_s = [y_1(\varvec{\ell }_1), \dots , y_{m_1}(\varvec{\ell }_1), \dots , y_1(\varvec{\ell }_n), \dots , y_{m_n}(\varvec{\ell }_n)]^{\top }\) and \({\mathbf {y}}_{ns} = [y_{m_1 + 1}(\varvec{\ell }_1), \dots , y_{M_1}(\varvec{\ell }_1), \dots , y_{m_N + 1}(\varvec{\ell }_N), \dots , y_{M_N}(\varvec{\ell }_N)]^{\top }\). In addition, let \(z_j(\varvec{\ell }_i)\) be the reported amount spent on fruits and vegetables on the natural log-scale corresponding to \(y_j(\varvec{\ell }_i)\). This is measured for all members of the finite population and, therefore, vectors \({\varvec{z}}_s\) and \({\varvec{z}}_{ns}\), defined in the same manner as \({\varvec{y}}_s\) and \({\varvec{y}}_{ns}\), denote reported values of FV expenditures corresponding to individuals who reported and did not report income, respectively. We examine the log-percent of income spent on fruits and vegetables, which can be written as \({\mathbf {z}} - {\mathbf {y}}\), by modeling \({\mathbf {y}}\) with an offset term of \({\mathbf {z}}\). Assume that there is a Gaussian spatial process, \(\omega (\cdot )\), defined on \(\pmb {\mathscr {L}}\) with covariance function \(K_{\omega }(d)\), and that \({\mathbf {y}}\) is a partial realization of this process. Finally, define the inclusion mechanism as a spatial process on \(\pmb {\mathscr {L}}\), which is dependent on \({\mathbf {y}}\) and another Gaussian spatial process, \(\upsilon (\cdot )\), defined on the same domain with covariance function \(K_{\upsilon }(d)\). A joint model defined in the form of our generic spatial paradigm (1) is

$$\begin{aligned}{}[y(\cdot ) \mid \omega (\cdot )] \times [I(\cdot ) \mid y(\cdot ), \upsilon (\cdot )] \times [\omega (\cdot )] \times [\upsilon (\cdot )] \end{aligned}$$
(2)

The first component of (2) is the conditional distribution of \({\mathbf {y}}\), \([y(\cdot ) \mid \omega (\cdot )]\). Assuming \({\mathbf {y}}\) is a \(T \times 1\) vector, this conditional distribution can be written as

$$\begin{aligned} y_j(\varvec{\ell }_i) = z_j(\varvec{\ell }_i) - {\mathbf {x}}_j(\varvec{\ell }_i)^{\top }\varvec{\beta } + \omega (\varvec{\ell }_i) + \epsilon _j(\varvec{\ell }_i) \, ; \quad \epsilon _j(\varvec{\ell }_i) {\mathop {\sim }\limits ^{iid}} N(0,\sigma ^2)\,. \end{aligned}$$
(3)

Here \(i = 1, \dots , 635\), \(j = 1, \dots , M_i\), and \(\varvec{\epsilon }\sim N({\mathbf {0}},\varvec{\Sigma }_\epsilon )\), where \(\varvec{\Sigma }_\epsilon = \sigma ^2{\mathbf {I}}\). Each \(\epsilon _j(\varvec{\ell }_i)\) corresponds to \(y_j(\varvec{\ell }_i)\) and \(\varvec{\epsilon }\) is defined in the same manner as \({\mathbf {y}}\). Similarly, define the covariates corresponding to the jth unit at the ith location as \({\mathbf {x}}_j(\varvec{\ell }_i)\). Here each \(10 \times 1\) vector \({\mathbf {x}}_j(\varvec{\ell }_i)\) corresponds to the outcome \(y_j(\varvec{\ell }_i)\). This vector of coefficients corresponds to the \(10 \times 1\) vector \(\varvec{\beta }\), \(\varvec{\beta } \sim N({\mathbf {0}}, \varvec{\Sigma }_\beta )\), and includes an intercept term, gender, household size, relationship status, age, age\(^2\), time-point, intervention status, and an interaction between intervention status and time-point. Following the notational convention of \({\mathbf {y}}_s\) and \({\mathbf {y}}_{ns}\), define the \(2015 \times 10\) matrix \({\mathbf {X}} = [{\mathbf {X}}_s^{\top }, {\mathbf {X}}_{ns}^{\top }]^{\top }\) as the collection of covariates from sampled and nonsampled individuals, where \({\mathbf {X}}_s = [{\mathbf {x}}_1(\varvec{\ell }_1), \dots , {\mathbf {x}}_{m_1}(\varvec{\ell }_1), \dots , {\mathbf {x}}_1(\varvec{\ell }_n), \dots , {\mathbf {x}}_{m_n}(\varvec{\ell }_n)]^{\top }\) and \({\mathbf {X}}_{ns}= [{\mathbf {x}}_{m_1+1}(\varvec{\ell }_1), \dots , {\mathbf {x}}_{M_1}(\varvec{\ell }_1), \dots , {\mathbf {x}}_{m_N+1}(\varvec{\ell }_N), \dots , {\mathbf {x}}_{M_N}(\varvec{\ell }_N)]^{\top }\). In addition, note that as the offset \(z_j(\varvec{\ell }_i)\) is placed on the right-hand side of this equation, we subtract the \({\mathbf {x}}_j(\varvec{\ell }_i)^{\top }\varvec{\beta }\) term to improve interpretation. In this way, a positive component in \(\varvec{\beta }\) corresponds to a positive increase in \({\mathbf {z}} - {\mathbf {y}}\), our outcome of interest.

Spatial variation is accounted for with the \(635 \times 1\) vector \(\varvec{\omega } \sim N({\mathbf {0}}, \mathbf {\Sigma }_\omega )\), where \(\mathbf {\Sigma }_\omega \) is a \(635 \times 635\) matrix defined by the covariance function \(K_\omega (d)\). Finally, construct a \(2015 \times 635\) site indicator matrix \({\mathbf {A}} = [{\mathbf {A}}_s^{\top },{\mathbf {A}}_{ns}^{\top }]^{\top }\), where \({\mathbf {A}}_s = [\oplus _{i=1}^{555} {\mathbf {1}}_{m_i} : {\mathbf {0}}]\) and \({\mathbf {A}}_{ns} = [\oplus _{i=1}^{635} {\mathbf {1}}_{M_i - m_i}]\) ( \(\oplus \) denotes the Kronecker sum). Thus the row in \({\mathbf {A}}\) corresponding to measurement \(y_j(\varvec{\ell }_i)\) has value 1 in \(i^{th}\) column and 0 elsewhere. We then have that \({\mathbf {y}} \sim N({\mathbf {z}} - {\mathbf {X}}\varvec{\beta } + {\mathbf {A}}\varvec{\omega }, \varvec{\Sigma }_\epsilon )\).

The second component of (2), \([I(\cdot ) \mid y(\cdot ), \upsilon (\cdot )]\) describes the response mechanism. Here the \(T \times 1\) vector \({\mathbf {I}}\) has element \(I_j(\varvec{\ell }_i) = 1\) if the corresponding jth individual in the ith location reported their income, e.g., \(y_j(\varvec{\ell }_i)\) is observed, and \(I_j(\varvec{\ell }_i) = 0\) if they did not report their income. This can be expressed as

$$\begin{aligned} I_j(\varvec{\ell }_i) \sim \mathrm{{Ber}}(\pi _j(\varvec{\ell }_i)) \, ; \quad \mathrm{{logit}}(\pi _j(\varvec{\ell }_i)) = y_j(\varvec{\ell }_i)\eta _{y} + {\mathbf {q}}_j(\varvec{\ell }_i)^{\top } \varvec{\eta } + \upsilon (\varvec{\ell }_i). \end{aligned}$$
(4)

The probability of response for each individual in the finite population is permitted to vary by its corresponding value of y, which is captured in the regression coefficient \(\eta _y\), \(\eta _y \sim N(0, \sigma ^2_{\eta _y})\). Similar to our modeling of the outcome, \({\mathbf {q}}_j(\varvec{\ell }_i)\) is a \(2 \times 1\) vector composed of an intercept term and age, which corresponds to a \(2 \times 1\) vector of coefficients \(\varvec{\eta }\), \(\varvec{\eta } \sim N({\mathbf {0}}, \varvec{\Sigma }_\eta )\). Additional spatial variability in the probability of inclusion is accounted for with \(\varvec{\upsilon }\), \(\varvec{\upsilon } \sim N({\mathbf {0}}, \mathbf {\Sigma }_\upsilon )\), where \(\mathbf {\Sigma }_\upsilon \) is a \(635 \times 635\) matrix and is defined by covariance function \(K_\upsilon (d)\).

In addition, we take the two processes, \(\omega \) and \(\upsilon \), to be independent. Collecting additional variance parameters in \(\varvec{\theta }\), the joint posterior distribution of (2) is proportional to

$$\begin{aligned} \begin{aligned} p(\varvec{\omega }, \varvec{\upsilon }, \varvec{\theta },&\varvec{\beta }, \varvec{\eta }, \eta _y, {\mathbf {y}}_{ns} \vert {\mathbf {y}}_s, {\mathbf {I}}) \\&\propto p(\varvec{\theta }) \times N(\varvec{\omega } \vert {\varvec{0}}, \varvec{\Sigma }_\omega ) \times N(\varvec{\upsilon } \vert {\varvec{0}}, \varvec{\Sigma }_\upsilon ) \times N(\varvec{\beta } \vert {\varvec{0}}, \varvec{\Sigma }_\beta ) \\&\quad \times N(\varvec{\eta } \vert {\varvec{0}}, \varvec{\Sigma }_\eta ) \times N(\eta _y \vert 0, \sigma _{\eta _y}^2) \times \prod _{i=1}^N \prod _{j=1}^{M_i} \mathrm{{Ber}}(I_j(\varvec{\ell }_i) \vert \pi _j(\varvec{\ell }_i)) \\&\quad \times \prod _{i=1}^N \prod _{j=1}^{M_i} N(y_j(\varvec{\ell }_i) \vert z_j(\varvec{\ell }_i) - {\mathbf {x}}_j(\varvec{\ell }_i)^{\top }\varvec{\beta } + \omega (\varvec{\ell }_i), \sigma ^2) \; , \end{aligned}\nonumber \\ \end{aligned}$$
(5)

where

$$\begin{aligned} \begin{aligned} \mathrm{{Ber}}(I_j(\varvec{\ell }_i) \vert \pi _j(\varvec{\ell }_i))&= \left( \frac{\mathrm{{exp}}[ y_j(\varvec{\ell }_i)\eta _{y} + {\mathbf {q}}_j(\varvec{\ell }_i)^{\top } \varvec{\eta } + \upsilon (\varvec{\ell }_i)]}{1 + \mathrm{{exp}}[ y_j(\varvec{\ell }_i)\eta _{y} + {\mathbf {q}}_j(\varvec{\ell }_i)^{\top } \varvec{\eta } + \upsilon (\varvec{\ell }_i)]}\right) ^{I_j(\varvec{\ell _i})} \\&\quad \times \left( \frac{1}{1+\mathrm{{exp}}[ y_j(\varvec{\ell }_i)\eta _{y} + {\mathbf {q}}_j(\varvec{\ell }_i)^{\top } \varvec{\eta } + \upsilon (\varvec{\ell }_i)]}\right) ^{1 - I_j(\varvec{\ell _i})}. \end{aligned} \end{aligned}$$

3.2 MCMC estimation strategy

Markov chain Monte Carlo must be used to sample from (5). A Gibbs update can be employed to sample the posterior distributions for \(\varvec{\beta }\) and \(\varvec{\omega }\) , which are

$$\begin{aligned} \varvec{\beta } \vert \cdot \sim N\left( \left( \mathbf {\sum }_\beta ^{-1} + {\mathbf {X}}_s^{\top } \mathbf {\sum }_\epsilon ^{-1}{\mathbf {X}}_s\right) ^{-1}{\mathbf {X}}_s^{\top } \mathbf {\sum }_\epsilon ^{-1}({\mathbf {y}}_s - {\mathbf {z}}_s - {\mathbf {A}}_s \varvec{\omega }), \left( \mathbf {\sum }_\beta ^{-1} + {\mathbf {X}}_s^{\top } \mathbf {\sum }_\epsilon ^{-1}{\mathbf {X}}_s\right) ^{-1} \right) ; \\ \varvec{\omega }\vert \cdot \sim N\left( \left( \mathbf {\sum }_\omega ^{-1} + {\mathbf {A}}_s^{\top }\mathbf {\sum }_\epsilon ^{-1}{\mathbf {A}}_s\right) ^{-1}{\mathbf {A}}_s^{\top }\mathbf {\sum }_\epsilon ^{-1}({\mathbf {y}}_s - {\mathbf {z}}_s + {\mathbf {X}}_s \varvec{\beta }), \left( \mathbf {\sum }_\omega ^{-1} + {\mathbf {A}}_s^{\top }\mathbf {\sum }_\epsilon ^{-1}{\mathbf {A}}_s\right) ^{-1} \right) , \end{aligned}$$

respectively. The conditional distributions for the remaining parameters are not available in closed form but can be sampled using a Metropolis–Hastings step. Specifically, we have

$$\begin{aligned}&{\mathbf {y}}_{ns} \vert {\mathbf {y}}_{s}, \varvec{\omega }, \varvec{\theta }, \varvec{\beta } \; \propto p(\varvec{\theta }) \times N(\varvec{\omega } \vert {\varvec{0}}, \varvec{\Sigma }_\omega ) \times N(\varvec{\beta } \vert {\varvec{0}}, \varvec{\Sigma }_\beta ) \times \prod _{i=1}^N \prod _{j=1}^{M_i} \mathrm{{Ber}}(I_j(\varvec{\ell }_i) \vert \pi _j(\varvec{\ell }_i)) \\&\times \prod _{i=1}^N \prod _{j=1}^{M_i} N(y_j(\varvec{\ell }_i) \vert z_j(\varvec{\ell }_i) - {\mathbf {x}}_j(\varvec{\ell }_i)^{\top }\varvec{\beta } + \omega (\varvec{\ell }_i), \sigma ^2) \;,\\&\varvec{\eta } \vert {\mathbf {y}}, \varvec{\theta } \propto p(\varvec{\theta }) \times N(\varvec{\eta } \vert {\mathbf {0}}, \varvec{\Sigma }_\eta ) \times \prod _{i=1}^N \prod _{j=1}^{M_i} \mathrm{{Ber}}(I_j(\varvec{\ell }_i) \vert \pi _j(\varvec{\ell }_i)) \; ,\\&\eta _y \vert {\mathbf {y}}_{s}, \varvec{\theta } \propto p(\varvec{\theta }) \times N(\eta _y \vert 0, \sigma _{\eta _y}^2) \times \prod _{i=1}^N \prod _{j=1}^{M_i} \mathrm{{Ber}}(I_j(\varvec{\ell }_i) \vert \pi _j(\varvec{\ell }_i)) \; \text {, and}\\&\varvec{\upsilon } \vert {\mathbf {y}}_{s}, \varvec{\theta } \propto p(\varvec{\theta }) \times N(\varvec{\upsilon } \vert {\varvec{0}}, \varvec{\Sigma }_\upsilon ) \times \prod _{i=1}^N \prod _{j=1}^{M_i} \mathrm{{Ber}}(I_j(\varvec{\ell }_i) \vert \pi _j(\varvec{\ell }_i))\; . \end{aligned}$$

The posterior samples of \({\mathbf {y}}_{ns}\) are then used to obtain posterior finite-population estimates. Specifically, we are interested in the mean income of finite population, \(\frac{1}{T}\sum _{i=1}^N \sum _{j=1}^{M_i} \mathrm{{exp}}[y_j(\mathbf {\ell }_i)]\), and the mean PIFV, \(\frac{1}{T}\sum _{i=1}^N \sum _{j=1}^{M_i} \mathrm{{exp}}[z_j(\mathbf {\ell }_i) - y_j(\mathbf {\ell }_i)]\). These values are calculated overall, by site and by time-point.

3.3 Alternative models encompassing response mechanism and the extent of spatial variation in data

Four models are considered in the form of (2) and are described below. For these models, regression parameters are considered independent, e.g., \(\varvec{\Sigma }_\beta = \sigma _\beta ^2 {\mathbf {I}}_{10}\) and \(\varvec{\Sigma }_\eta = \sigma _\eta ^2 {\mathbf {I}}_2\), and their associated variance parameters, \(\sigma _\beta ^2\) and \(\sigma _\eta ^2\), are fixed in both the simulation and data analysis. Similarly, \(\sigma _{\eta _y}\) is fixed. The spatial covariance functions are taken to be exponential, as described in Sect. 1.

Model 1. Non-spatial in outcome with ignorable response This model is a standard linear regression model and, therefore, spatial effects (\(\varvec{\omega }\) and \(\varvec{\upsilon }\)) are fixed at 0. Inclusion parameters are also fixed at 0 so that the probability of inclusion is a fixed number. We take \(\varvec{\theta } = \sigma ^2\) and \(p(\varvec{\theta }) = IG(\sigma ^2 \vert a, b)\).

Model 2. Non-spatial association in outcome with preferential response Preferential response is now accounted for through \(\eta _y\) but spatial effects are again fixed at 0. Similar to Model 1, \(\varvec{\theta } = \sigma ^2\) and \(p(\varvec{\theta }) = IG(\sigma ^2 \vert a, b)\).

Model 3. Spatial association in outcome with preferential response This model accounts for spatial association in the outcome but fixes \(\mathbf {\upsilon } = 0\). Therefore, \(\varvec{\theta } = [\sigma ^2, \delta _\omega ^2, \phi _\omega ]^{\top }\) and \(p(\varvec{\theta }) = IG(\sigma ^2 \vert a, b) \times IG(\delta _\omega ^2 \vert a_\omega , b_\omega ) \times Unif(\phi _\omega \vert c _\omega , d_\omega )\).

Model 4. Spatial association in outcome and probability of inclusion with preferential response This model expands upon Model 3 by permitting spatial association in the probability of response. We take \(\varvec{\theta } = [\sigma ^2, \delta _\omega ^2, \phi _\omega , \delta _\upsilon ^2, \phi _\upsilon ]^{\top }\) and \(p(\varvec{\theta }) = IG(\sigma ^2 \vert a, b) \times IG(\delta _\omega ^2 \vert a_\omega , b_\omega ) \times Unif(\phi _\omega \vert c _\omega , d_\omega ) \times IG(\delta _\upsilon ^2 \vert a_\upsilon , b_\upsilon ) \times Unif(\phi _\upsilon \vert c _\upsilon , d_\upsilon )\).

3.4 Model comparison and assessment strategy

Model fit was evaluated in two ways. In general, consider a sample of size t drawn from a population of size T with outcome \({\mathbf {y}} = [{\mathbf {y}}_s ^{\top }, {\mathbf {y}}_{ns}^{\top }]^{\top }\). Without loss of generality, say \(y_h \in {\mathbf {y}}_s\) if \(h = 1, \dots , t\) and \(y_h \in {\mathbf {y}}_{ns}\) if \(h = t + 1, \dots , T\). Replicated datasets, \({\mathbf {y}}_\mathrm{{rep}}^{(l)} = [y_{\mathrm{{rep}},1}^{(l)} \dots y_{\mathrm{{rep}},t}^{(l)}]^{\top }\), can be generated from the pointwise posterior predictive distribution at each iteration l. These are used to formulate the predictive model choice criteria:

$$\begin{aligned} D = \sum _{h=1}^t (y_h -\text {E}[y_{\mathrm{{rep}},h} \mid {\mathbf {y}}_s])^2 + \sum _{h=1}^t \mathrm{{var}}(y_{\mathrm{{rep}},h}\mid {\mathbf {y}}_s) \end{aligned}$$

described in Gelfand and Ghosh (1998), and the Gneiting–Raftery Score (Gneiting & Raftery, 2007),

$$\begin{aligned} \mathrm{{GRS}} = -\sum _{h=1}^t \frac{(y_h -\text {E}[y_{\mathrm{{rep}},h} \mid {\mathbf {y}}_s])^2}{\mathrm{{var}}(y_{\mathrm{{rep}},h}\mid {\mathbf {y}}_s)} - \sum _{h=1}^t \text {log}\;\mathrm{{var}}(y_{\mathrm{{rep}},h}\mid {\mathbf {y}}_s) . \end{aligned}$$

In this formulation, lower values of D and higher values of GRS are indicative of better model fit. For L iterations, we approximate \(\text {E}[y_{\mathrm{{rep}},h} \mid {\mathbf {y}}_s] \approx \frac{1}{L}\sum _{l=1}^L y_{\mathrm{{rep}},h}^{(l)}\) and \(\mathrm{{var}}(y_{\mathrm{{rep}},h}\mid {\mathbf {y}}_s) \approx \frac{1}{L-1} \sum _{l=1}^L (y_{\mathrm{{rep}},h}^{(l)} - \frac{1}{L}\sum _{l=1}^L y_{\mathrm{{rep}},h}^{(l)})^2\). For simulated datasets, where \({\mathbf {y}}_{ns}\) is known, these measures can be extended to all observations, e.g., summing to T instead of t in each score.

4 Simulation

To examine the ability of the proposed models to capture various sampling schemes, a simplified dataset was simulated and three response scenarios were implemented. For simplicity, in this simulation study, we predict income (on the log-scale) with only the covariates gender and household size for a finite population of size 2000, e.g., \(z_j(\varvec{\ell }_i)\) is fixed at 0 and \( -{\mathbf {x}}_j(\varvec{\ell }_i)\) is replaced by \( {\mathbf {x}}_j(\varvec{\ell }_i)\) for all i and j in (3). For each unit of the population, gender was drawn from a bernoulli distribution with the probability of female set to 0.8 and household size was drawn from a Poisson distribution with a mean of 4. To induce spatial correlation, a 5 \(\times \) 5 square was created and 500 locations were randomly assigned within the square and distance matrix was constructed from these locations. The spatial process parameters were fixed at \(\sigma ^2 = 1\), \(\delta _\omega ^2 = 1\), and \(\phi _\omega = 0.5\). Each unit of the population was randomly assigned to a location, with the requirement that at least one unit was located at each location. Regression parameters were fixed at \(\varvec{\beta } = [\beta _0, \beta _\mathrm{{fem}},\beta _{hhs}]^{\top } = [10, -0.2, 0.1]^{\top }\), to reflect an average income of \(\mathrm{{exp}}(10) =\) $22,000 in the reference group, a small average reduction in income for females, and a small average increase in income for larger household sizes. Log-income values were generated from (3).

Three scenarios were considered to reflect possible response scenarios in which there is spatial association in the outcome. In Scenario 1, income is from a spatial process but there is no preferential response. This arises from Model 3, fixing \(\eta _y = 0\) and \({\mathbf {q}} = [1, \dots , 1]^{\top }\). The probability of inclusion was set at 0.5, which is equivalent to fixing \(\eta = 0\). This resulted in a selection of 54% of the simulated data. In the second scenario, income is from a spatial process which is reported preferentially, as described in Model 3. Here, \(\eta _y\) was set to 0.5 and \(\varvec{\eta } = [\eta _0,\eta _\mathrm{{fem}}] = [-4, -1]^{\top }\), to reflect higher odds of response for larger values of income and lower odds of response for women. The choice of these coefficients resulted in 54.15% of the simulated data having income responses. The third scenario considers income as coming from a spatial process whose response in preferential and whose inclusion probability is dependent on another spatial process, which is described in Model 4. To reflect this, we set \(\phi _\upsilon = 1.5\) and \(\delta _\upsilon ^2 = 1\); this resulted in responses in 48.1% of the simulated data. All data generation and analyses were performed using R version 3.6.1 (R Core Team, 2018).

Linear interpolation plots from the full simulated data and the subset data from the three scenarios are shown in Fig. 2. As expected, Scenario 1 (a simple random sample) is the most similar to the full dataset. In the cases of preferential response (Scenarios 2 and 3), the interpolated plots have larger regions of high income than the true dataset. This is most apparent in the western region of the graph, where values below 8 are rare in this instance. Comparing Scenarios 2 and 3, there appears to be some smoothing, with fewer pockets of low income in the west and northeast of the graph, which is due to the spatial association induced on the probability of response in Scenario 3.

Models were run for 10,000 iterations with 1000 burn-in, as examination of individual trace plots suggested sufficient mixing and convergence of the non-spatial parameters. At each iteration g, estimates of the nonsampled units were drawn and estimates for the population mean, \({\bar{y}}^{(g)} = \frac{1}{T}\Big (\sum _{i = 1}^n\sum _{j=1}^{m_i} \mathrm{{exp}}[y_j(\varvec{\ell }_i)] + \sum _{i = 1}^N\sum _{j=m_i + 1}^{M_i} \mathrm{{exp}}[y_j(\varvec{\ell }_i)^{(g)}]\Big )\) were calculated. The variance parameter \(\sigma _\beta ^2\) was fixed at 1000 to reflect an uninformative prior, while the \(\sigma _\eta ^2\) and \(\sigma _{\eta _y}\) terms were fixed at 10 as a weakly informative prior restricting the range of the logistic regression coefficients. The non-spatial, \(\sigma ^2\), and spatial, \(\delta _\omega ^2\) and \(\delta _\upsilon ^2\), variance components were assigned prior distributions of IG(2,10), to reflect a small point mass centered at 10. The spatial range parameters, \(\phi _\omega \) and \(\phi _\upsilon \), were assigned prior distributions of Unif(0.1, 2), to reflect a spatial range of 1.5 (3/2) to 30 (3/0.1). MCMC sampling was performed using the computer program JAGS (Plummer, 2017) in R.

The results of Scenario 1 are presented in Table 3. While the credible intervals for each model contain the true value of regression coefficients for female and household size, as well as the true finite-population mean, the non-spatial models fail to contain the true intercept and the non-spatial variance values in their credible intervals. As expected, both spatial models were able to correctly capture the spatial parameters, \(\phi _\omega \), and \(\delta _\omega ^2\), for the outcome. In addition, the coefficients \(\eta _{0}\) and \(\eta _{y}\) are small and have credible intervals containing 0 for Models 2–4, which suggests that these models correctly demonstrate no evidence of preferential response. The response-level spatial parameters in Model 4 also suggest no evidence of spatial variability, as the credible interval of \(\phi _\upsilon \) is nearly the same range as the prior distribution given and the spatial variance, \(\delta _\upsilon ^2\), is very close to 0. In addition, the fit of Model 4 is negligibly poorer than Model 3, as there is no spatial association in the probability of response.

Table 3 Simulation results of scenario 1: spatial outcome, random response

The results of Scenario 2 are given in Table 4 and examines a preferential response of a spatially associated outcome. Importantly, unlike Scenario 1, the two non-spatial models fail to capture the true finite-population mean of 9.94 within their 95% credible intervals. This is also true of the intercept term, \(\beta _0\), and non-spatial variance, \(\sigma ^2\), although we expect \(\sigma ^2\) to be larger, as it absorbing the variability in the outcome attributed to spatial association. Model 1 also incorrectly provides a positive estimative of \(\beta _\mathrm{{fem}}\) whose credible interval does not contain the true value of \(-0.2\). Moreover, while Models 2–4 provide similar estimates of \(\eta _\mathrm{{fem}}\), Model 2 fails to capture the true values of \(\eta _{0}\) and \(\eta _{y}\) in its credible intervals, unlike the two spatial models. Possibly due to the poor modeling of income, Model 2 spuriously concludes that there is no evidence of preferential sampling. Finally, as in Scenario 1, both spatial models have similar estimates and correctly capture the spatial parameters \(\phi _\omega \) and \(\delta _\omega ^2\). In Model 4, even though \(\phi _\upsilon \) varies, it estimates very small values of \(\delta _\upsilon ^2\), which correctly suggests little evidence of spatial association in the probability of response. The model fit statistics both slightly favor Model 3 to Model 4, due to the lack of response-level spatial association, and prefer the spatial to non-spatial models.

Table 4 Simulation results of scenario 2: spatial outcome, preferential sampling

When incorporating spatial association into the probability of income response, seen in Table 5, Model 4 outperforms the other three models in terms of model fit by correctly accounting for this additional association in the logistic regression component of the model. As before, non-spatial models have poorer model fit and larger estimates of the non-spatial variance term. Unlike Models 2–4, Model 1 fails to include the true finite-population mean in its credible interval, which may be attributable to a disregard for the preferential response. As in Scenario 2, Model 1 incorrectly provides a positive estimate of \(\beta _\mathrm{{fem}}\), and all models except Model 2 contain the true intercept in their credible intervals. Models 2–4 each correctly capture the logistic regression coefficients, \(\eta _0\), \(\eta _\mathrm{{fem}}\), and \(\eta _y\). In addition, the spatial models provide reasonable estimates of \(\phi _\omega \) and \(\delta _\omega ^2\), and in the case of Model 4, \(\phi _\upsilon \) and \(\delta _\upsilon ^2\).

Table 5 Simulation results of scenario 3: spatial outcome, non-ignorable sampling, spatial inclusion

5 Data analysis

5.1 Implementation

As before, Models 1–4 were implemented using JAGS (Plummer, 2017) in R and run for 10,000 iterations with 1000 burn-in, as examination of individual trace plots suggested sufficient mixing and convergence of the non-spatial parameters. At each iteration g, the finite-population mean income, \({\bar{y}}^{(g)} = \mathrm{{exp}}\left[ \frac{1}{T}\Big (\sum _{i = 1}^n\sum _{j=1}^{m_i} y_j(\varvec{\ell }_i) + \sum _{i = 1}^N\sum _{j=m_i + 1}^{M_i} y_j(\varvec{\ell }_i)^{(g)}\Big )\right] \), and the finite-population mean percentage of income spent on fruits and vegetables, \({\bar{y}}^{(g)} = \frac{1}{T}\Big (\sum _{i = 1}^n\sum _{j=1}^{m_i} \mathrm{{exp}}\left[ z_j(\varvec{\ell }_i) - y_j(\varvec{\ell }_i)\right] + \sum _{i = 1}^N\sum _{j=m_i + 1}^{M_i} \mathrm{{exp}}\left[ z_j(\varvec{\ell }_i) - y_j(\varvec{\ell }_i)^{(g)}\right] \Big )\), were calculated using estimates of the nonsampled units drawn at that iteration. The variance parameter \(\sigma _\beta ^2\) was fixed at 1000 to reflect an diffuse prior, while the \(\sigma _\eta ^2\) and \(\sigma _{\eta _y}\) terms were fixed at 0.68 as a weakly informative prior restricting the range of the exponentiated logistic regression coefficients to \(\frac{1}{5}\) and 5. The non-spatial \(\sigma ^2\), and spatial, \(\delta _\omega ^2\), variance components were assigned prior distributions of IG(2,10) and IG(2,2), respectively, to reflect small point masses centered at 10 and 2. The prior for \(\delta _\upsilon \) was assigned to be uniform distribution ranging from 0 to 0.75, so that the standard deviation reported in Sect. 2 is included in this range. This tight prior was found to improve convergence in the other logistic regression parameters. The spatial range parameters, \(\phi _\omega \) and \(\phi _\upsilon \), were assigned prior distributions of Unif (0.1, 2), to reflect a spatial range of 1.5 (3/2) to 30 (3/0.1). Computation times on a 2018 MacBook Pro laptop were negligible for Models 1 and 2 but were approximately 15 h for Model 3 and 24 h for Model 4.

Table 6 Results of regression models predicting percentage of income spent on fruits and vegetables (log-scale)

5.2 Results

The results of this analysis are presented in Table 6. Notably, there is no evidence of an intervention effect on PIFV in any of the models, denoted by the coefficient \(\beta _\mathrm{{treat*follow}}\) being small and all credible intervals containing 0. An improvement in the intervention effect would have seen a larger positive coefficient. This finding supports previous findings of no community-level changes as reported in Ortega et al. (2016). The four models yield comparable estimates of all \(\beta \) regression coefficients, so the following interpretations are based on Model 4. All else equal, the amount of spending on fruits and vegetables by men was estimated to have occurred with a multiple of 58% (exp(\(-\) 0.54)) applied to the corresponding spending by women. Larger reported households were associated with higher amounts of household income spent on fruits and vegetables, with PIFV multiplicatively increasing by 15% for every additional household member. Spending on fruits and vegetables by food purchasers who reported having less than a high-school education was estimated to have occurred with a multiple of 1.8 times the spending of those with a high-school diploma or more education. There was a small negative linear effect of age on the outcome, as well as a small positive quadratic term. PIFV was also lower at follow-up, which is consistent with the raw percentages presented in Table 1. There were no differences were detected for partner status.

Confirming the preliminary analyses discussed in Sect. 2, all three models that account for preferential response conclude that larger incomes are more likely to provide their income. Models 2–4 agree that age is not associated with the probability of response. Accounting for association in the probability of response appears to also best fits the data, as evidenced by the lowest value of D and highest GRS value. Interestingly, the model fit for Model 2 is poorest (on the GRS scale), suggesting that accounting for preferential sampling while not accounting for spatial association (either at the outcome or response levels) leads to poorer fit. In addition, Model 3 fits poorer than Model 1 (and Model 2 on the D scale), which suggests that spatial association at the outcome level may have been accounted for with the inclusion of additional covariates.

However, our estimation of the finite-population mean of the percent of income spent on fruits and vegetables is very model specific. Most importantly, it is evident that in ignoring the presence of preferential sampling, Model 1 spuriously underestimates this percentage. The reason for this is clearly explained by examining each corresponding model’s finite-population estimate of income. As Model 1 does not account for the fact that individuals with lower incomes are less likely to report their income, there is much less variability in the average income of the community. This leads to a spurious estimate almost $10,000 and 30% larger than the next closest estimate of $29,364.66, given by Model 2. It is important to note that Model 1’s estimates are also much larger than the averages presented in Table 1, while Models 2–4 present credible intervals that contain these values. While it is true that the additional variability from accounting for preferential sampling leads to larger posterior credible intervals, we note that no part of Model 1’s credible interval is contained in any of the other models. Despite this apparent disagreement, Model 4’s incorporation of spatial association in the response mechanism results in a compromise between Model 1 and 3. This trend is also observed in the finite-population mean fraction, where higher estimated incomes in Model 1 correspond to much lower estimated fractions than the other models. Based on model fit statistics, we conclude that Model 4 provides the best estimate of the finite-population fraction mean, which is 26%.

In addition, as posterior samples are drawn for all individuals with non-response, finite-population estimates can be constructed for each community at both time-points, which are presented for each model in Table 7. Bolded estimates represent instances where the 95% credible intervals do not include the raw average reported in Table 1. Importantly, these results emphasize the importance of imputation. Models 2–4 show remarkable similarity in these estimates and conclude that raw data of 6 of the 8 sites underestimates the percentage of income spent on fruits and vegetables at baseline and all but 1 of the 8 underestimate at follow-up. Model 3 additionally identifies site 1 at baseline, but this is not supported by the rest of the models. Even in the case of Model 1, at baseline 3 were found to underestimate the percentage and 1 suggested overestimation, and at follow-up, 2 communities were found to underestimate as well. Encouragingly, in all but one case (site 7 at baseline) of the disagreements with the raw data that Model 1 identified, Models 2–4 also identified these cases. In addition, Models 2–4 suggest that the baseline total is underestimating the true average and all models agree that the follow-up total is underestimated. Interpolated maps corresponding to these finite-population estimates from Model 4 are presented in Fig. 3.

Fig. 3
figure 3

Linear interpolation plots of estimated PIFV by site and time-point from model 4

Table 7 Finite-population estimates and 95% CI of percentage of income spent on fruits and vegetables by community and timepoint

6 Conclusion

This paper presents a new framework to account for data whose outcome is spatially associated and where the probability of response is assumed to be associated with the value of the outcome. We examine the implications of this data on finite-population quantities and demonstrate how to perform Bayesian estimation on these values. This works builds on an existing literature in spatial statistics, Bayesian finite-population estimation, and missing data and has a wide range of applications in health, economics, and environmental work.

Specifically, in our presented data analysis, we find that accounting for spatial association at both the outcome and probability levels provides the best model fit. By accounting for such associations and preferential responses in income, we are more confident in concluding that there was no effect on the percent of income spent on fruits and vegetables at the community level attributable to the corner-store intervention. We were, however, able to more accurately describe the individual communities by estimating finite-population means at each site level. In fact, the finite population estimates of income that stem from the modeling ignoring both spatial association and preferential response are substantially larger than the other models and are less believable, given the community. This directly contributed to lower estimates of the percent of income spent on fruits and vegetables in these communities, compared to the other models. In future projects, in these regions, interventions that focus on FV access and knowledge could target areas with low estimated percentages. In addition, future work can examine ways in which income information can be solicited from lower income neighborhoods and what factors may be driving this non-response (besides the level of income). This work can also assist in more accurate needs assessments of local communities and, therefore, improve the allocation of health resources. Further, as there is interest in estimating intervention effects, new approaches described in the casual inference literature which can account spatial association (Akbari et al., 2021) may be appropriate.

The literature of Bayesian finite-population estimation in the presence of spatial association is limited and future extensions to the work presented in this paper are numerous. While this model draws on the preferential sampling framework described by Diggle et al. (2010), we examined a missing data case that had similar evidence of preferential response. However, a data analysis implementing this technique on a dataset with preferential sampling from a finite population would be a strong addition to the literature. The authors view the framework discussed in Sect. 3.1 to be flexible enough to allow for other, more complicated sampling schemes as well, although more simulation work would be needed to fully understand the implications of these on finite-population quantities, especially if spatial association is assumed. Further, while a linear relationship between the log-odds of response and income was assumed, other relationships may be considered in future works.

In addition, while the sample size presented in the data analysis of this paper was small, this framework can be extended to account for massive sample sizes. The problem of spatial modeling for big data stems from the inversion of dense covariance matrices, but modern work in covariance approximation has made this feasible. Such techniques include low-rank models, sparsity-inducing processes, and map reducing approaches (Banerjee, 2017; Heaton et al., 2018; Guhaniyogi & Banerjee, 2018; Banerjee, 2020), see, e.g., and references therein.

Further, while the authors have only considered a Gaussian process to describe the outcome variable, this framework could be extended to other processes, such as mixtures of Gaussian processes (Neelon et al., 2014), a generalized Gaussian process (Chan & Dong, 2011), or a spatial Dirichlet process (Gelfand et al., 2005). Extensions to multivariate responses and spatio-temporal data may also serve useful, particularly when examining health outcomes. Finally, learning about spatial difference boundaries (Gao et al., 2022) from finite population estimates for regionally aggregated health outcomes is witnessing growing interest among public health researchers and will comprise future investigations.