1. Introduction
Phylogenetic comparative methods (PCMs) have a well-established history of illuminating the underpinnings of trait evolution, leveraging the rich insights present within phylogenetic trees [
1]. They have traditionally been employed in the analysis of quantitative trait evolution, a practice deeply ingrained in the academic literature [
2,
3,
4,
5,
6]. Despite this, an emerging and consistent observation within this area of study is the evolution of categorical traits, which are often represented in categorical or count forms and extend across diverse species. The vast applicability of these methods in a variety of biological scenarios underscores their importance, not only for specialists but also for a broader community of researchers.
Count data has been successfully used to elucidate a range of biological phenomena, for example, the number of toxicological activities in snake venom, or functional activities [
7]. In this study, Poisson regression models identified the diversity of diet as a significant predictor of venom’s functional activity, demonstrating the value of such models in making ecological predictions. Similarly, in a study on Amazonian forest birds, count data played a crucial role in unraveling the relationship between body mass, flight efficiency, diet, and road-crossing frequency [
8]. Here, binomial regression models provided valuable insights into the predictors of road-crossing, which serves as a proxy for the bird’s ability to cross habitat gaps—an essential survival skill in the rapidly changing Amazonian landscape. Furthermore, count data in the form of gene copy numbers in yeast species has been utilized to investigate the relationship between metabolic gene copy number and growth rate. A comparative analysis using GEE [
9] revealed a clear correlation, providing significant insights into yeast ecology. However, the challenge in such studies often lies in the appropriate analysis of count data. Traditional linear regression forms are ill-suited for such data, since the assumption of normality in the residuals can lead to misleading results when applied to count values.
Hence, there arises the need for alternative models that can adequately account for the specific nature of count data. Enter the realm of GLMs, which includes the phylogenetic Poisson regression [
9] and the phylogenetic negative binomial regression which will be developed in this study, that serve as robust tools for such data. Both these models consider the count nature of the data but differ in their assumptions. While the Poisson regression model assumes equal mean and variance, the negative binomial regression model is equipped to handle overdispersion, where the variance exceeds the mean. Although both models find use in different scenarios, it is crucial for practitioners to be aware of potential inaccuracies resulting from the Poisson regression model if the assumption of equal mean and variance is violated [
10]. In such scenarios, the phylogenetic negative binomial regression model presents itself as a superior alternative, offering an extra parameter to independently adjust the variance from the mean. This independence can improve model fit and provide more accurate results, highlighting the model’s significance.
Furthermore, while the application of the Poisson regression framework is well detailed in previous studies [
9], our work focuses on the novel application of the negative binomial regression model in the context of phylogenetic regressions. The remainder of this study, therefore, seeks to introduce this novel phylogenetic negative binomial regression model, test it rigorously, and demonstrate its utility in analyzing count-dependent variables. We believe that the insights gained from this endeavor will provide a fresh perspective to researchers in trait evolution and related fields, enabling a more comprehensive and nuanced understanding of evolutionary dynamics. We will demonstrate the model’s efficacy through two distinct empirical assessments: an analysis of lizard egg count as it relates to body mass, and an exploration of mammalian litter size influenced by factors such as the number of teats, longevity, body mass, etc. Through these applications, we hope to underscore the model’s utility and contribute to improved methodologies in the study of related species.
The paper is structured as follows:
Section 2 outlines our methodology.
Section 2.1 discusses regression under a GLM framework, specifically delving into independent Poisson and negative binomial regression. In
Section 2.4, we elaborate on the regression under a GEE for phylogenetically dependent data, emphasizing GEE for phylogenetic Poisson and negative binomial regression.
Section 3.2 documents our empirical studies on lizard egg-laying and mammalian litter sizes. We present the results of our work, including simulation and empirical analysis outcomes, in
Section 3. This is followed by the discussion in
Section 4, and the conclusion in
Section 5. Scripts and relevant files developed for this project can be accessed in
https://www.tonyjhwueng.info/phypoinb2reg accessed on 27 July 2023.
2. Materials and Methods
We present the regression models utilized for analyzing count variables. Traditional linear regression methods are often inadequate for handling count data analysis, primarily due to their assumption of normally distributed residuals, which is unsuitable for count data. In a setting independent of evolution for a group of species, the regression analysis using count data as the response variable and other covariates is analyzed using the GLM described in
Section 2.1 where the Poisson regression is described in
Section 2.1.1 and the negative binomial regression is described in
Section 2.1.2. Note that [
11] considered using a single predictor for modeling the count variable under a negative binomial regression model for a couple of empirical data analyzes, our study proposes a general framework concerning multiple covariates and provides a detailed inference. When considering evolution as a dependent process described by a phylogenetic tree relation among species, the regression analysis using count data as response variables and other covariates is carried out by the generalized estimation equation (GEE) in
Section 2.4 where the phylogenetic Poisson regression is described in
Section 2.4.1 and the phylogenetic negative binomial regression is described in
Section 2.4.2.
2.1. Applying GLM in Regression Analysis
GLMs are fundamental tools for regression analysis across various scientific fields, including biology. They offer a flexible statistical framework to analyze different types of response variables, making them an invaluable tool in the biological researcher’s toolkit. In the following subsections, we delve into two specific applications of GLMs in the context of biological research: independent Poisson regression and independent negative binomial regression.
The first
Section 2.1.1 elaborates on the use of independent Poisson regression, a powerful method particularly suited for the analysis of count data, which is frequently encountered in biological studies. Subsequently, in
Section 2.1.2, we turn our attention to negative binomial regression, a model instrumental in handling count data exhibiting overdispersion—a common phenomenon in biological data. Details of the models’ mathematical formulation can be found in
Appendix A.3, specifically in
Appendix A.3.1 and
Appendix A.3.2.
2.1.1. Independent Poisson Regression in Biology
Biological research often calls for the analysis of count data—be it bacterial colonies in a dish [
12], the number of times a gene gets expressed [
13], or species enumerated in an ecological survey [
14]. A method conducive to such an analysis is Poisson regression, an efficient instrument to evaluate count data [
15]. This technique assumes that the response variable, adheres to a Poisson distribution suitable for count variables, with a mean occurrence rate
. The probability mass function of the Poisson random variable
y is
Poisson regression applies a log link function, making it suitable for count data analysis and potentially providing more reliable statistical outcomes [
16]. To determine parameters in a Poisson regression model, one can utilize the maximum likelihood estimation (MLE) method, employing numerical strategies such as Newton’s method for deriving the MLE [
17] (see
Appendix A.3.1).
2.1.2. Exploiting Negative Binomial Regression for Overdispersion
In biological studies, researchers frequently confront situations where the response variable is count-based and variable in a way that it surpasses the mean. This phenomenon, called overdispersion, suggests an inherent data structure that requires careful modeling. In these instances, the negative binomial regression model becomes an instrumental tool for analysis in various biolofical fields such as the molecular count data from scRNA-seq experiments [
18], the weekly dengue haemorrhagic fever cases [
19], or the number of fledglings from a nest or inflorescences on a plant [
20].
The negative binomial distribution adds an additional parameter (often denoted as
r) which models the over-dispersion relative to the Poisson distribution (where the mean equals the variance). This is particularly useful for count data, where often the variance is greater than the mean. The negative binomial model operates under the assumption that the response variable follows a negative binomial distribution. The probability mass function of the negative binomial random variable
y is
where
is the probability of success. The model establishes a relationship between the mean response and its predictors through a logarithmic link function, creating a linear relationship with the parameters [
17]. This mathematical framework suggests that a systematic alteration in a predictor variable leads to a proportional change in the response. Further details on this can be found in
Appendix A.3.2.
In
Section 2.2 we provide a preliminary analysis for two empirical datasets using the two count regression models of independent types.
2.2. A Preliminary Analysis
A quick analysis of two empirical datasets using the two GLMs is reported in
Table 1 where two fitted regression models (GLM: Poisson regression model vs. the negative binomial model) for the lizard dataset and the mammal dataset are presented. The response variable for the lizard dataset [
21] is the egg number per year (EPY) with the covariates egg mass (EM) in gram. The response variable for the mammal dataset [
22] is the litter number per year (LY) with another 4 covariates: litter body mass (LS), offspring value as per equation (OV), longevity in years (LG), and whether at least 1 established alien population has successfully spread or not (Spread).
For the mammalian dataset, the variance () slightly surpasses the mean (), favoring the Poisson regression model, as evidenced by a lower AICc value and a higher weight compared to the negative binomial regression model. In contrast, for the lizard dataset, the variance () significantly exceeds the mean () in egg count per year. This discrepancy favors the negative binomial regression model, which has a lower AICc value and a higher weight compared to the Poisson regression model. This preference for the negative binomial model may be attributed to its unique ability to handle overdispersion, a feature where the phylogenetic negative binomial model particularly excels.
The AICc [
23], defined in Equation (
3), provides a measure for comparing the quality of different statistical models for a dataset.
Here, AIC is the Akaike Information Criterion (
),
k is the number of parameters,
is the likelihood value computed from using parameter estimates, and
n is the taxa size. The Akaike weights
for the
ith model measured the importance of the models in the set of candidate models are calculated using Equation (
4):
where
[
24] represents the difference in AICc values between model
j and the model with the smallest AICc value (the best model among
m models) and provides a measure of how much worse model
j is compared to the best model. Here,
where
for Poisson regression and
for negative binomial regression.
In this equation,
is the difference in AICc values between the
ith model and the minimal AICc model. The comparison of the fit using the modified Akaike Information criteria (AIC) [
25] is shown in
Table 1 where the two empirical datasets show a slight preference for either model. For the mammal dataset, the response trait (litter number) has a mean
and a variance of
. The Poisson regression model provides a slightly better fit to this dataset. For the lizard dataset, the response trait (egg count per year) has a variance of
and a mean of
. In addition, the regression analysis using covariates: size at maturity, average size, age at maturity, egg mass, clutch size, and clutch mass favors the negative binomial regression model over the Poisson regression model.
In
Section 2.3, we introduce the phylogenetic trait evolution of both continuous types as well as the discrete types associated with their count regression model.
2.3. Phylogenetic Trait Evolution
It has been widely accepted that due to speciation and other evolutionary phenomena, species evolved in a dependent manner along a phylogenetic tree. The regression analysis may be more robust when incorporating trees into the analysis. For instance, a five-species phylogenetic tree containing 5 taxa
, and
x is presented in
Figure 1.
For the continuous trait evolution shown in the lower right panel of
Figure 1, trajectories are simulated using the tree traversal algorithm under a continuous random process [
26] where five speciation events have occurred in subsequent order, starting at the root (
) and continuing immediately afterward. The observed trait values (comparative data) for these five species, represented by
, and
, are captured at
.
The evolution of these traits can be described using the Brownian motion model (BM) [
27]. As an example, the trait variable for species
v, for example, observed at time
t, is expressed as
. Here,
denotes the ancestral state of species
v,
represents a positive constant parameter, which is the rate of evolution, and
is a Wiener process, a mathematical construct used in the modeling of stochastic processes. Each species is assumed to have the same rate
, for
and possess independent identical Wiener processes
, for
.
For the count trait evolution shown in the lower-left panel of
Figure 1, The tips values at
denoted as
are assumed to have values
. Note that one can also consider generating the sample through a tree traversal [
28] where starting with the root node with a given value then each successive internal node (the circled points in the figure) is simulated using the status of the starting node plus or minus a Poisson random variable with the rate equal to the branch length multiplied by the status of the nodes where the plus or minus is determined by a Bernoulli trial with value 1 or
with probability drawn from a uniform distribution.
It has been known that the tree is incorporated into the analysis for quantitative regression analysis and many packages have been developed to contribute to the community [
29,
30,
31,
32]. However, conceiving that the negative binomial regression may be potentially useful to analyze count data in phylogenetic regression analysis as the Poisson regression, this work delineates the two phylogenetic regression models for counting dependent variables in a more comprehensive manner using simulation and empirical analysis. In particular, the
matrix will be used for modeling the dependent relationship for the phylogenetic regression using the count response variable. Since the tree can be equivalently transformed into a square matrix
where each element of
measures the shared branch length between the two tips [
33,
34]. For example, the
for the tree in in the upper left panel of
Figure 1 can be represented as in Equation (
5).
The conceptual regression curves shown in the upper-right panel of
Figure 1 using two types of trees and a toy dataset with trait values
for dependent count variable, and
for quantitative covariate trait variable are shown in
Figure 2.
2.4. Leveraging GEEs for Regression Analysis of Phylogenetically Dependent Data
Trait evolution research [
35], a crucial element in evolutionary biology, requires careful consideration of phylogenetic dependencies embedded within count data. A proven technique to handle these dependencies involves embedding a matrix
, extracted from the phylogenetic tree, in the regression model. This crucial integration accommodates species interrelationships, thereby facilitating precise interpretations. Our analysis primarily focuses on two types of regression models, namely Poisson and negative binomial regression, both members of the exponential family whose probability density function can be expressed in Equation (
6) [
36].
GEE emerged as invaluable tools when applying these models. GEE prescribes a parameterization for
, the distribution parameter of the exponential family, using a link function
that associates the mean function
and the variance function
V of the response variable to the model’s linear predictors. Subsequently, the first two moments of
y (
and
V), are represented through a series of functional relationships that encompass the parameters
,
,
, and
where
where
is a design matrix of
consisting of
(the vector of 1s) and the covariates
[
9]. The final estimation equation for the regression parameter
is obtained by setting the derivative of the
estimating equations shown in Equation (
7) to zero.
In the ensuing subsections, we delve deeper into the application of GEE in the domain of phylogenetic trait evolution analysis. We study it in two contexts: the widely acknowledged phylogenetic Poisson regression model and an emerging model, the phylogenetic negative binomial regression model. Given that these regression models are not extensively examined in the current literature, our efforts aim to illuminate their usage and implications, thereby contributing to a broader understanding of phylogenetic trait evolution. Of particular note is the incorporation of the
matrix into the GEE when solving to obtain the estimators (see Equation (
9) for Poisson regression case and Equation (
11) for negative binomial case). This integration is key to our models where the phylogenetic correlated and dependence among species are used, and the advantages it offers are explicitly discussed in
Appendix A, where we lay out the more intricate mathematical details for comprehensive access and understanding. The detailed mathematical formulations of these models are provided in the
Appendix A.4, with a specific mention in
Appendix A.4.1 and
Appendix A.4.2.
2.4.1. Utilizing GEE in Phylogenetic Poisson Regression
Within the domain of evolutionary biology, GEE have become an indispensable tool for scrutinizing count data with inherent correlation structures. This correlation could either be explicitly defined or need estimation. GEE can work with various correlation structures, including independence, exchangeable, autoregressive order 1, and unstructured, as discussed in [
15].
A pioneering application of GEE in comparative biology was presented by [
9], where the correlation structure is derived from a phylogenetic tree, thereby accounting for the evolutionary interrelations between species. This framework significantly broadens the ability to analyze comparative data, particularly within the Poisson regression model context.
Given a group of
n species associated with a trait vector
. Consider a count response variable
for the
ith observation with an associated mean rate
. The density function for this variable follows a Poisson distribution and can be represented in an exponential form through a simple logarithmic transformation (
). Within the GEE framework, the first and second moments,
and
, can be derived directly from the link function’s derivatives and its inverse Equation (
8).
This approach enables a robust calculation of both the expected value and variance of the response variable, taking into account the phylogenetically structured correlation in the data.
GEE is used to estimate regression parameters in
, employing the chain rule to compute the derivative of the negative log-likelihood function. This process yields an expression involving the
ith regression parameter’s partial derivative, which can be cast into matrix form, offering a comprehensive perspective on the regression estimates across all observations and parameters. The variance-covariance matrix was further refined [
9] for use in phylogenetic comparative analyses, proposing as a combination of the phylogenetic correlation matrix
. The general estimating equation in Equation (
7) can be written in matrix form shown in Equation (
9).
where
and
).
Given a set of response variables
and design matrix
, the regression parameters
can be estimated by solving this nonlinear equation system, providing an exhaustive characterization of trait data within their phylogenetic context (see
Appendix A.3.1).
2.4.2. Applying GEE in Negative Binomial Regression
In biological research, the GEE method is in a need of being utilized to perform negative binomial regression. This approach is primarily due to its ability to accommodate overdispersion commonly observed in biological data. It also facilitates adjustments for non-independence resulting from repeated measures, phylogenetic structure, or spatial and temporal autocorrelation, offering significant benefits for applications in evolutionary ecology, population biology, and comparative phylogenetics [
37].
In this section, we explore the application of the GEE in negative binomial regression, emphasizing its use in phylogenetic comparative methods. The negative binomial distribution is characterized by parameters r and p, which correspond to the number of successes and the success probability in each trial, respectively.
To conduct a negative binomial regression using the GEE, we employ the canonical log-link function, linking the mean response to the linear predictors. This log-link function, in the context of negative binomial regression, is expressed in terms of
r and the mean response
(i.e.,
). Implementing the GEE necessitates specifying the mean, link, and variance functions. In a negative binomial regression context, the mean function
and the variance function can be written as in Equation (
10)
To determine the regression estimates for
, we express the link function and the variance function in terms of the observed variables and
. Subsequently, we compute the partial derivative of
with respect to
, which is crucial for solving the GEE in Equation (
7).
From the foundational assumptions, we can derive estimating equations for the regression parameters
. These equations, also referred to as GEE and seen in Equation (
7), serve as consistent estimators of
. Their expression in a matrix form, depicted in Equation (
11), greatly facilitates solving the nonlinear system for
. In the development of the phylogenetic negative binomial regression, the GEE is transformed into a matrix form to encapsulate the phylogenetic correlation matrix,
. This matrix encodes the phylogenetic relationships among species. The process of integrating
into deriving the phylogenetic negative binomial regression can be represented by the matrix equation in Equation (
11).
where
and
. This matrix-based expression of the GEE facilitates solving the nonlinear system for
(see
Appendix A.3.2).
The GEE offers a flexible and robust approach to modeling phylogenetic comparative data using negative binomial regression, especially in the presence of overdispersion. Effectively incorporating this into phylogenetic comparative methods can significantly advance our understanding of evolutionary patterns and processes. To test for the significance of the effect, we use the bootstrap technique [
38] to generate the samples and re-estimate the parameters for constructing the confidence interval for the empirical analysis. The bootstrap means and the standard error for the regression parameter are reported.
3. Results
To assess the efficacy of our proposed method, we conducted a simulation focused on evaluating the parameter estimation of both regression models. Details regarding the simulation process can be found in
Section 3.1. Furthermore, the outcomes specific to the phylogenetic Poisson regression model and the phylogenetic negative binomial regression model are presented in
Section 3.1.1 and
Section 3.1.2, respectively.
3.1. Simulation
To evaluate the method, we performed a simulation to assess the two regression models in the aspect of parameter estimation. The simulation uses four taxa sizes:
n = 16, 32, 64, 128 and 4 types of trees: coalescent tree, balanced tree, left tree, and star tree. One covariate is used for the assessment of the model and the true parameter for
is set to
. Subsequently, the parameters for simulating responses are computed using the mean function and variance function for the Poisson distribution (as shown in Equation (
8)), and the Negative Binomial distribution (as shown in Equation (
10)), respectively. The simulation uses 1000 replicates.
Simulate discrete trait: The ordsamplep.poi function we created initiates the generation of simulated data for a phylogenetic Poisson regression model. It produces values from a multivariate normal distribution with zero mean and covariance matrix derived from the phylogenetic tree. These values are then transformed into Poisson-distributed variables using the qpois function, aligning with a Poisson distribution for a particular mean function parameter. Consequently, the simulated data mimics count traits with phylogenetic correlation, well-suited for phylogenetic Poisson regression analysis.
Similarly, the
ordsamplep.nb2 function we created, backed by the
MASS library [
39], generates simulated data for the phylogenetic negative binomial regression model. It begins by creating random multivariate normal distribution values, consistent with the variance-covariance matrix
of the phylogenetic tree. These values are then transformed into negative binomially distributed variables using the
qnbinom function with a negative binomial distribution for a particular mean function
parameter. As a result, the simulated data manifests count traits with phylogenetic dependencies, providing an ideal testing ground for the phylogenetic negative binomial regression model.
When scaling the tree, each branch is assigned a length of less than 1. This can result in zero counts being generated due to the short branch lengths when using count random generators such as a Poisson or negative binomial. Hence, it is imperative to give careful consideration to tree lengths, especially when assessing discrete character changes. Trees of shorter lengths tend to show minimal variation, often exhibiting just 0, 1, or 2 changes from their root to their tip. Hence, expanding these trees by adding more tips might not yield much additional information. Conversely, for elongated trees that average around 15 changes, the varied branches could be more informative, potentially leading to more refined estimates. Instead of merely normalizing tree height, there is merit in exploring the dynamics of taller trees.
Simulate quantitative covariate trait: the predictive trait can be assumed to follow a Brownian motion with root value
estimated from the Brownian motion model [
40] with rate parameter
. This can be directly applied to the multivariate normal distribution
as the joint distribution for each Brownian motion random variable is again a normal distribution [
33,
41]. For non-normal distributed trait, one can considere to simulate the covariate
from the exponential distribution with a known rate parameter.
3.1.1. Phylogenetic Poisson Regression
The response data
are simulated using the quantile function of the Poisson distribution with the specified mean
and the covariate
simulated by the multivariate normal distribution with mean 0 and covariance
. Then, the phylogenetic Poisson regression model is fitted to the samples. For each taxon and tree type case, 1000 samples are simulated and the mean estimates and standard deviation for the regression parameters are reported in
Table 2.
In
Table 2, parameter estimates for a phylogenetic Poisson regression model under four types of tree (coalescent, balanced, left, star) and four taxa sizes (
) are presented. Specifically, it reports the mean and standard deviation (in parentheses) of the estimates for the parameters
and
. Furthermore, the means of the parameter estimates seem to be fairly consistent across the various taxa sizes for each tree type. This indicates the robustness of these estimates to the size of taxa considered in the model.
One important observation from the table is the trend of the standard deviations across different taxa sizes, as also shown in
Figure 3. For each tree type and parameter (
and
), the standard deviation appears to decrease as the taxa size increases from 16 to 128. This suggests that the precision of the parameter estimates improves with increasing taxa size, which is consistent with the idea that larger sample sizes generally provide more precise estimates in statistical analyses. In other words, the estimates for
and
become more reliable and less variable with the increase in taxa size.
3.1.2. Negative Binomial Regression
Given the covariate samples
, true parameters
and
r which is set to
. The response data
are simulated of the negative binomial distribution with specified mean
with dispersion parameter
. Then, the phylogenetic negative binomial regression model is fitted to the samples. For each taxon and tree type case, 1000 samples are simulated, and the mean estimates and standard deviation for the regression parameters are reported in
Table 3 and
Figure 4.
The parameter estimation results as shown in
Table 3 and
Figure 4 give valuable insights into the behavior of phylogenetic negative binomial regression across different tree types and taxon sizes.
From the
Table 3, it becomes clear that as the taxa size increases, the mean estimates for the intercept (
) tend to converge more closely to their true values. Meanwhile, the mean estimates for the slope (
) are close to the true value, albeit with a relatively larger standard deviation. This observation reinforces that the phylogenetic negative binomial regression model is performing within expectations, demonstrating its capability to furnish relatively precise parameter estimates across varied conditions. Yet, a deeper exploration into the nuances of parameter estimation within this model reveals challenges in identifying a consistent overarching trend. Some taxa sizes exhibit pronounced variability, marked by significant standard deviations, complicating any straightforward trend interpretation. The quest for consistency across different tree types also proves elusive. This deviation is in sharp contrast to the more discernible patterns typically observed in the phylogenetic Poisson regression model. Such disparities underscore the intricate challenges associated with the phylogenetic negative binomial regression, especially when juxtaposed against other regression frameworks.
One explanation for these larger variations can be found in the nature of the estimation process itself. As mentioned in the text, the estimation of these parameters includes the solving of nonlinear equations (see Equation (
11)). Such equations, especially when applied to complex biological data such as phylogenetic trees, can lead to a wide range of solutions. This might explain the relatively large standard deviations observed in these results. It is also worth mentioning that while some variability in the estimates is expected and indeed necessary for the model to adapt to different data structures, overly large variances might compromise the precision of the model. Therefore, this is a point that might warrant further investigation and potential refinements to the model or the estimation process.
As shown in
Table 3 and
Figure 4, the high variances could impact the precision of the model. These variances could be a result of the complexity involved in solving nonlinear equations, especially in complex biological data such as phylogenetic trees. Strategies to manage such issues could include employing better algorithms, as will be discussed later, to enhance the solution-finding process. Additionally, lowering the tolerance could help minimize the divergence in results.
By comparing the two models via
Table 2 and
Table 3. Upon comparing the phylogenetic Poisson regression and phylogenetic negative binomial regression models, one notices key differences. The phylogenetic Poisson regression model shows consistent parameter estimates for different taxa sizes, with values for
and
closely clustering around the true values of 3 and 5, respectively, across various tree types. This consistency is accompanied by a remarkably small standard deviation, suggesting a high degree of precision. In contrast, the phylogenetic negative binomial regression model displays more variability in its estimates. Although the values of
and
are in close proximity to the true values, they diverge more than the phylogenetic Poisson regression model’s estimates. Additionally, the larger standard deviations point towards greater uncertainty. Despite the higher variability, phylogenetic negative binomial regression could be more suitable under less predictable conditions, while phylogenetic Poisson regression provides stable estimates, proving reliable under steady scenarios.
3.2. Empirical Analysis
Building upon our simulation results, we proceeded to apply our proposed models to real-world empirical datasets. These results served to contextualize and validate our simulated observations, enabling us to examine the models’ efficacy in real-life scenarios. The patterns of variability noted in the simulations across tree types and taxa sizes were echoed in the empirical studies, reinforcing our understanding of these dynamics. The use of the phylogenetic negative binomial regression model on the lizard and mammalian datasets also emphasized the model’s applicability to count variables in a real biological context. Thus, these empirical analyses provide tangible insights that complement and substantiate our simulation findings.
In our empirical analysis, we currently make use of two different datasets, as outlined in
Table 1. The first dataset refers to lizards, with a specific focus on egg count (a count variable) [
21]. The second dataset is derived from mammalian data, where the variable of interest is the size of the litter, which refers to the simultaneous live birth of multiple offspring of a single mother [
22].
The efficacy of the phylogenetic negative binomial regression model is tested against these two datasets. In
Section 3.2.1, we apply this model to the lizard dataset to examine egg count in relation to body mass [
21]. For the mammalian dataset, detailed in
Section 3.3, we use this model to investigate litter size in response to factors such as number of teats, litter size, longevity, and body mass [
22]. These empirical assessments serve to underscore the utility of the phylogenetic negative binomial regression model in the study of count variables.
3.2.1. Lizard’s Egg-Laying Count
In various species observed in nature, there appears to be an inverse relationship between egg mass and the number of eggs laid per incubation. For instance, despite having a similar body size to chickens, the kiwi bird produces only one egg, while chickens lay multiple eggs. In our research, we have employed data that were previously collected and studied by [
21]. This data primarily focus on the body size, represented as Snout–Vent Length (SVL), of the lizard species
S. undulatus. Covariates such as age at maturity, egg mass, clutch size, and total eggs were incorporated in the regression analysis, with the response variable being the number of litters.
To enhance the reproducibility of our methodology, we have thoroughly detailed our data pre-processing steps. Initially, the raw data from [
21] was collected and compiled in
Table A1, found in
Appendix A.2.1. This table illustrates the mean values of life history count variables for all Sceloporus populations, with the sources for the life-history data and mtDNA specified in the final two columns [
42].
We then employed this dataset in our regression analyses, correlating the aforementioned covariates with the number of litters. It is worth mentioning that the phylogenetic tree of the lizard is also based on the study by [
21] and is visually represented in
Figure 5. The entire process ensures a comprehensive and replicable approach to analyzing the data, thus ensuring the robustness of our findings.
The regression estimates for the model are shown in
Table 4.
Both the Poisson regression coefficient and the negative binomial regression coefficient can be interpreted as follows: for a one-unit change in the predictor variable, the difference in the logs of expected counts of the response variable is expected to change by the respective regression coefficient, given the other predictor variables in the model are held constant.
In the negative binomial regression (glm.nb), the Egg Mass (EM) coefficient () is . In practical terms, an increase in Egg Mass by one unit results in a decrease in the log of expected counts of Eggs Per Year (EPY) by unit. This model, with a standard deviation of , confirms the inverse association between egg size and the number of eggs laid per year.
The Poisson regression regression (glm.poi) exhibits an EM coefficient () of . Meaning, an increase in EM by one unit leads to a decrease in the log of expected counts of EPY by unit. With a standard deviation of , this model reveals a more pronounced inverse relationship between egg size and annual egg production compared to the negative binomial models.
In the phylogenetic negative binomial regression (phygee.nb), the coefficient of EM () is . This indicates that an increase in EM by one unit results in a unit reduction in the log of expected EPY counts. With a standard deviation of , this phylogenetic model indicates a slightly stronger inverse correlation between egg size and number laid per year than the glm.nb model.
The phylogenetic Poisson regression via GEEs (phygee.poi) present an EM coefficient () of . This suggests that for every increase in EM of one unit, the log of expected EPY counts decreases by . The model has a standard deviation of . Although the phylogenetic model demonstrates a less pronounced effect of egg mass on yearly egg production than the non-phylogenetic Poisson model, it still exhibits a stronger correlation than the negative binomial models.
The comparative analysis of these four models provides some valuable insights. It is noteworthy that the negative binomial models (both general and phylogenetic) show a consistent negative relationship between egg size and annual egg production, albeit with slightly smaller effect sizes. This aligns with existing studies, which also suggest this inverse relationship. However, our work enhances the understanding of this relationship by employing both GLMs and generalized estimation equations, which capture and consider the evolutionary relationship between species.
In comparison, the Poisson models (both non-phylogenetic and phylogenetic) indicate a more pronounced inverse relationship between egg size and annual egg production, which extends the findings of previous research. These results suggest that the use of different statistical models can reveal nuanced details about biological relationships that would not be as evident with a single model. The regression curves are presented in
Figure 6.
In summary, the regression models in
Table 4 suggest a consistent trend across both negative binomial and Poisson regressions, and their respective phylogenetic versions. All point towards the same biological interpretation: larger egg sizes are associated with fewer eggs being laid per year, with this effect being somewhat stronger in the Poisson models. As illustrated in
Figure 6, the negative binomial regression exhibits greater variation and broader confidence intervals than the Poisson regression, whether in phylogenetic or standard contexts. It is worth noting that various genetic and environmental factors can influence egg size in lizards, including the lineage, ambient temperature, and overall health of the animal. A critical observation is the apparent trade-off between egg size and the number of eggs produced annually, potentially representing an adaptive response to optimize offspring survival. Larger eggs might yield stronger, more resilient offspring, but at the cost of reduced egg quantity. This trade-off carries implications for reproductive strategies, population dynamics, and the broader evolutionary course of different lizard species. Understanding this phenomenon further would yield important insights into lizard life history strategies and their responses to environmental changes.
3.3. Litter Size in Mammal
In mammals, there is a general pattern where the maximum litter size is often constrained by the number of teats, and typically, the average litter size is about half the number of teats. This trend, however, can vary across different species [
45]. For instance, the naked mole-rat (
Heterocephalus glaber) presents an interesting deviation. It has approximately 12 nipples, but its average litter size is about 11 pups, significantly higher than the typical half. Moreover, the litter size can range from 3 to 12 pups and can even reach as high as 28 in some instances [
46].
The need for a comprehensive understanding inspired us to devise a new methodology. Our study incorporates the collection of data pertaining to mammal litter sizes and other traits, such as body mass, gestation length, weaning age, height, and other relevant measurements, as detailed in [
22]. The trait data depicted in
Table A2 was obtained from [
47] (see
Appendix A.2.2). We further integrated the mammalian phylogenetic tree, as shown in
Figure 7, derived from Phylotastic [
48] in a manner similar to [
49]. The featured phylogeny encompasses 30 species with complete datasets across all four traits under consideration.
Having discussed the collection and integration of the data, it is crucial to expound on how this gathered data is utilized. This brings us to the application of statistical models, which provide the framework for interpreting the information and yielding insightful findings. Under the assumption that the observations are independently distributed, parameter estimation falls within the purview of the GLM. Progressing to phylogenetic negative binomial regression analysis, initial estimates of parameters are computed using the R package glm with the Poisson family. This step solidifies the foundation for our subsequent analysis, ensuring that our data are primed for generating robust conclusions.
The regression estimates for the model are shown in
Table 5.
In the negative binomial regression (glm.nb), biological factors impact the expected log count of Litter Size per Year (LY). A one-unit increase in Litter Mean Body Size (LS) or Offspring Value (OV) reduces the log count of LY by and unit respectively, all else being equal. Longevity (LG) also has a smaller, negative impact, with a decrease per unit increase. Contrastingly, a unit increase in Spread (SP) increases the LY log count by unit.
For the phylogenetic negative binomial regression (phygee.nb), the same biological factors show slightly altered impacts but maintain their directions. The log count of LY decreases by , , and unit with each unit increase in LS, OV, and LG, respectively. However, a unit rise in SP increases the LY log count by unit.
In the Poisson regression (glm.poi), each unit increase in LS, OV, and LG reduces the log count of LY by , , and unit, respectively. Conversely, a unit rise in SP increases the log count of LY by unit.
In the phylogenetic Poisson regression (phygee.poi), each unit increase in LS, OV, and LG leads to a decrease in the log count of LY by , , and unit, respectively. In contrast, a unit rise in SP increases the LY log count by unit.
In summary, across all models, an increase in each of LS, OV, and LG while holding all other predictors in the model is associated with a decrease in the expected log count of LY, while an increase in SP is associated with an increase in the expected log count of LY. However, the magnitude of these impacts varies between the models. While the Poisson models generally estimate larger effects than the negative binomial models, the negative binomial models accounts for larger variation than the Poisson models. In addition, the phylogenetic models estimate slightly different impacts compared to their non-phylogenetic counterparts.