Search | arXiv e-print repository

arXiv:2407.15020 [pdf]

Integrating Attentional Factors and Spacing in Logistic Knowledge Tracing Models to Explore the Impact of Training Sequences on Category Learning

Authors: Meng Cao, Philip I. Pavlik Jr., Wei Chu, Liang Zhang

Abstract: In category learning, a growing body of literature has increasingly focused on exploring the impacts of interleaving in contrast to blocking. The sequential attention hypothesis posits that interleaving draws attention to the differences between categories while blocking directs attention toward similarities within categories. Although a recent study underscores the joint influence of memory and a… ▽ More In category learning, a growing body of literature has increasingly focused on exploring the impacts of interleaving in contrast to blocking. The sequential attention hypothesis posits that interleaving draws attention to the differences between categories while blocking directs attention toward similarities within categories. Although a recent study underscores the joint influence of memory and attentional factors on sequencing effects, there remains a scarcity of effective computational models integrating both attentional and memory considerations to comprehensively understand the effect of training sequences on students' performance. This study introduces a novel integration of attentional factors and spacing into the logistic knowledge tracing (LKT) models to monitor students' performance across different training sequences (interleaving and blocking). Attentional factors were incorporated by recording the counts of comparisons between adjacent trials, considering whether they belong to the same or different category. Several features were employed to account for temporal spacing. We used cross-validations to test the model fit and predictions on the learning session and posttest. Our findings reveal that incorporating both attentional factors and spacing features in the Additive Factors Model (AFM) significantly enhances its capacity to capture the effects of interleaving and blocking and demonstrates superior predictive accuracy for students' learning outcomes. By bridging the gap between attentional factors and memory processes, our computational approach offers a more comprehensive framework for understanding and predicting category learning outcomes in educational settings. △ Less

Submitted 22 June, 2024; originally announced July 2024.

Comments: 7 pages, 3 figures, Educational Data Mining 2024

arXiv:2407.14335 [pdf, other]

Quantifying the Blockchain Trilemma: A Comparative Analysis of Algorand, Ethereum 2.0, and Beyond

Authors: Yihang Fu, Mingwei Jing, Jiaolun Zhou, Peilin Wu, Ye Wang, Luyao Zhang, Chuang Hu

Abstract: Blockchain technology is essential for the digital economy and metaverse, supporting applications from decentralized finance to virtual assets. However, its potential is constrained by the "Blockchain Trilemma," which necessitates balancing decentralization, security, and scalability. This study evaluates and compares two leading proof-of-stake (PoS) systems, Algorand and Ethereum 2.0, against the… ▽ More Blockchain technology is essential for the digital economy and metaverse, supporting applications from decentralized finance to virtual assets. However, its potential is constrained by the "Blockchain Trilemma," which necessitates balancing decentralization, security, and scalability. This study evaluates and compares two leading proof-of-stake (PoS) systems, Algorand and Ethereum 2.0, against these critical metrics. Our research interprets existing indices to measure decentralization, evaluates scalability through transactional data, and assesses security by identifying potential vulnerabilities. Utilizing real-world data, we analyze each platform's strategies in a structured manner to understand their effectiveness in addressing trilemma challenges. The findings highlight each platform's strengths and propose general methodologies for evaluating key blockchain characteristics applicable to other systems. This research advances the understanding of blockchain technologies and their implications for the future digital economy. Data and code are available on GitHub as open source. △ Less

Submitted 19 July, 2024; originally announced July 2024.

arXiv:2407.04967 [pdf, other]

posteriordb: Testing, Benchmarking and Developing Bayesian Inference Algorithms

Authors: Måns Magnusson, Jakob Torgander, Paul-Christian Bürkner, Lu Zhang, Bob Carpenter, Aki Vehtari

Abstract: The generality and robustness of inference algorithms is critical to the success of widely used probabilistic programming languages such as Stan, PyMC, Pyro, and Turing.jl. When designing a new general-purpose inference algorithm, whether it involves Monte Carlo sampling or variational approximation, the fundamental problem arises in evaluating its accuracy and efficiency across a range of represe… ▽ More The generality and robustness of inference algorithms is critical to the success of widely used probabilistic programming languages such as Stan, PyMC, Pyro, and Turing.jl. When designing a new general-purpose inference algorithm, whether it involves Monte Carlo sampling or variational approximation, the fundamental problem arises in evaluating its accuracy and efficiency across a range of representative target models. To solve this problem, we propose posteriordb, a database of models and data sets defining target densities along with reference Monte Carlo draws. We further provide a guide to the best practices in using posteriordb for model evaluation and comparison. To provide a wide range of realistic target densities, posteriordb currently comprises 120 representative models and has been instrumental in developing several general inference algorithms. △ Less

Submitted 6 July, 2024; originally announced July 2024.

arXiv:2406.18603 [pdf, other]

Confidence interval estimation of mixed oil length with conditional diffusion model

Authors: Yanfeng Yang, Lihong Zhang, Ziqi Chen, Miaomiao Yu, Lei Chen

Abstract: Accurately estimating the mixed oil length plays a big role in the economic benefit for oil pipeline network. While various proposed methods have tried to predict the mixed oil length, they often exhibit an extremely high probability (around 50\%) of underestimating it. This is attributed to their failure to consider the statistical variability inherent in the estimated length of mixed oil. To add… ▽ More Accurately estimating the mixed oil length plays a big role in the economic benefit for oil pipeline network. While various proposed methods have tried to predict the mixed oil length, they often exhibit an extremely high probability (around 50\%) of underestimating it. This is attributed to their failure to consider the statistical variability inherent in the estimated length of mixed oil. To address such issues, we propose to use the conditional diffusion model to learn the distribution of the mixed oil length given pipeline features. Subsequently, we design a confidence interval estimation for the length of the mixed oil based on the pseudo-samples generated by the learned diffusion model. To our knowledge, we are the first to present an estimation scheme for confidence interval of the oil-mixing length that considers statistical variability, thereby reducing the possibility of underestimating it. When employing the upper bound of the interval as a reference for excluding the mixed oil, the probability of underestimation can be as minimal as 5\%, a substantial reduction compared to 50\%. Furthermore, utilizing the mean of the generated pseudo samples as the estimator for the mixed oil length enhances prediction accuracy by at least 10\% compared to commonly used methods. △ Less

Submitted 19 June, 2024; originally announced June 2024.

arXiv:2406.18035 [pdf, other]

Local Linear Recovery Guarantee of Deep Neural Networks at Overparameterization

Authors: Yaoyu Zhang, Leyang Zhang, Zhongwang Zhang, Zhiwei Bai

Abstract: Determining whether deep neural network (DNN) models can reliably recover target functions at overparameterization is a critical yet complex issue in the theory of deep learning. To advance understanding in this area, we introduce a concept we term "local linear recovery" (LLR), a weaker form of target function recovery that renders the problem more amenable to theoretical analysis. In the sense o… ▽ More Determining whether deep neural network (DNN) models can reliably recover target functions at overparameterization is a critical yet complex issue in the theory of deep learning. To advance understanding in this area, we introduce a concept we term "local linear recovery" (LLR), a weaker form of target function recovery that renders the problem more amenable to theoretical analysis. In the sense of LLR, we prove that functions expressible by narrower DNNs are guaranteed to be recoverable from fewer samples than model parameters. Specifically, we establish upper limits on the optimistic sample sizes, defined as the smallest sample size necessary to guarantee LLR, for functions in the space of a given DNN. Furthermore, we prove that these upper bounds are achieved in the case of two-layer tanh neural networks. Our research lays a solid groundwork for future investigations into the recovery capabilities of DNNs in overparameterized scenarios. △ Less

Submitted 25 June, 2024; originally announced June 2024.

Comments: arXiv admin note: text overlap with arXiv:2211.11623

arXiv:2406.16221 [pdf, other]

F-FOMAML: GNN-Enhanced Meta-Learning for Peak Period Demand Forecasting with Proxy Data

Authors: Zexing Xu, Linjun Zhang, Sitan Yang, Rasoul Etesami, Hanghang Tong, Huan Zhang, Jiawei Han

Abstract: Demand prediction is a crucial task for e-commerce and physical retail businesses, especially during high-stake sales events. However, the limited availability of historical data from these peak periods poses a significant challenge for traditional forecasting methods. In this paper, we propose a novel approach that leverages strategically chosen proxy data reflective of potential sales patterns f… ▽ More Demand prediction is a crucial task for e-commerce and physical retail businesses, especially during high-stake sales events. However, the limited availability of historical data from these peak periods poses a significant challenge for traditional forecasting methods. In this paper, we propose a novel approach that leverages strategically chosen proxy data reflective of potential sales patterns from similar entities during non-peak periods, enriched by features learned from a graph neural networks (GNNs)-based forecasting model, to predict demand during peak events. We formulate the demand prediction as a meta-learning problem and develop the Feature-based First-Order Model-Agnostic Meta-Learning (F-FOMAML) algorithm that leverages proxy data from non-peak periods and GNN-generated relational metadata to learn feature-specific layer parameters, thereby adapting to demand forecasts for peak events. Theoretically, we show that by considering domain similarities through task-specific metadata, our model achieves improved generalization, where the excess risk decreases as the number of training tasks increases. Empirical evaluations on large-scale industrial datasets demonstrate the superiority of our approach. Compared to existing state-of-the-art models, our method demonstrates a notable improvement in demand prediction accuracy, reducing the Mean Absolute Error by 26.24% on an internal vending machine dataset and by 1.04% on the publicly accessible JD.com dataset. △ Less

Submitted 23 June, 2024; originally announced June 2024.

MSC Class: 68T07; 68T05; 62M10; 62M20; 90C90; 91B84

arXiv:2406.15514 [pdf, other]

How big does a population need to be before demographers can ignore individual-level randomness in demographic events?

Authors: John Bryant, Tahu Kukutai, Junni L. Zhang

Abstract: When studying a national-level population, demographers can safely ignore the effect of individual-level randomness on age-sex structure. When studying a single community, or group of communities, however, the potential importance of individual-level randomness is less clear. We seek to measure the effect of individual-level randomness in births and deaths on standard summary indicators of age-sex… ▽ More When studying a national-level population, demographers can safely ignore the effect of individual-level randomness on age-sex structure. When studying a single community, or group of communities, however, the potential importance of individual-level randomness is less clear. We seek to measure the effect of individual-level randomness in births and deaths on standard summary indicators of age-sex structure, for populations of different sizes, focusing on on demographic conditions typical of historical populations. We conduct a microsimulation experiment where we simulate events and age-sex structure under a range of settings for demographic rates and population size. The experiment results suggest that individual-level randomness strongly affects age-sex structure for populations of about 100, but has a much smaller effect on populations of 1,000, and a negligible effect on populations of 10,000. Our conclusion is that analyses of age-sex structure in historical populations with sizes on the order 100 must account for individual-level randomness in demographic events. Analyses of populations with sizes on the order of 1,000 may need to make some allowance for individual-level variation, but other issues, such as measurement error, probably deserve more attention. Analyses of populations of 10,000 can safely ignore individual-level variation. △ Less

Submitted 20 June, 2024; originally announced June 2024.

Comments: 28 pages, 8 figures, 3 tables

MSC Class: 91-XX

arXiv:2406.05304 [pdf, other]

Polytomous Explanatory Item Response Models for Item Discrimination: Assessing Negative-Framing Effects in Social-Emotional Learning Surveys

Authors: Joshua B. Gilbert, Lijin Zhang, Esther Ulitzsch, Benjamin W. Domingue

Abstract: Modeling item parameters as a function of item characteristics has a long history but has generally focused on models for item location. Explanatory item response models for item discrimination are available but rarely used. In this study, we extend existing approaches for modeling item discrimination from dichotomous to polytomous item responses. We illustrate our proposed approach with an applic… ▽ More Modeling item parameters as a function of item characteristics has a long history but has generally focused on models for item location. Explanatory item response models for item discrimination are available but rarely used. In this study, we extend existing approaches for modeling item discrimination from dichotomous to polytomous item responses. We illustrate our proposed approach with an application to four social-emotional learning surveys of preschool children to investigate how item discrimination depends on whether an item is positively or negatively framed. Negative framing predicts significantly lower item discrimination on two of the four surveys, and a plausibly causal estimate from a regression discontinuity analysis shows that negative framing reduces discrimination by about 30\% on one survey. We conclude with a discussion of potential applications of explanatory models for item discrimination. △ Less

Submitted 7 June, 2024; originally announced June 2024.

arXiv:2406.04655 [pdf, other]

Bayesian Inference for Spatial-temporal Non-Gaussian Data Using Predictive Stacking

Authors: Soumyakanti Pan, Lu Zhang, Jonathan R. Bradley, Sudipto Banerjee

Abstract: Analysing non-Gaussian spatial-temporal data typically requires introducing spatial dependence in generalised linear models through the link function of an exponential family distribution. However, unlike in Gaussian likelihoods, inference is considerably encumbered by the inability to analytically integrate out the random effects and reduce the dimension of the parameter space. Iterative estimati… ▽ More Analysing non-Gaussian spatial-temporal data typically requires introducing spatial dependence in generalised linear models through the link function of an exponential family distribution. However, unlike in Gaussian likelihoods, inference is considerably encumbered by the inability to analytically integrate out the random effects and reduce the dimension of the parameter space. Iterative estimation algorithms struggle to converge due to the presence of weakly identified parameters. We devise an approach that obviates these issues by exploiting generalised conjugate multivariate distribution theory for exponential families, which enables exact sampling from analytically available posterior distributions conditional upon some fixed process parameters. More specifically, we expand upon the Diaconis-Ylvisaker family of conjugate priors to achieve analytically tractable posterior inference for spatially-temporally varying regression models conditional on some kernel parameters. Subsequently, we assimilate inference from these individual posterior distributions over a range of values of these parameters using Bayesian predictive stacking. We evaluate inferential performance on simulated data, compare with fully Bayesian inference using Markov chain Monte Carlo and apply our proposed method to analyse spatially-temporally referenced avian count data from the North American Breeding Bird Survey database. △ Less

Submitted 7 June, 2024; originally announced June 2024.

Comments: 31 pages, 8 figures

arXiv:2406.03707 [pdf, other]

What Should Embeddings Embed? Autoregressive Models Represent Latent Generating Distributions

Authors: Liyi Zhang, Michael Y. Li, Thomas L. Griffiths

Abstract: Autoregressive language models have demonstrated a remarkable ability to extract latent structure from text. The embeddings from large language models have been shown to capture aspects of the syntax and semantics of language. But what {\em should} embeddings represent? We connect the autoregressive prediction objective to the idea of constructing predictive sufficient statistics to summarize the… ▽ More Autoregressive language models have demonstrated a remarkable ability to extract latent structure from text. The embeddings from large language models have been shown to capture aspects of the syntax and semantics of language. But what {\em should} embeddings represent? We connect the autoregressive prediction objective to the idea of constructing predictive sufficient statistics to summarize the information contained in a sequence of observations, and use this connection to identify three settings where the optimal content of embeddings can be identified: independent identically distributed data, where the embedding should capture the sufficient statistics of the data; latent state models, where the embedding should encode the posterior distribution over states given the data; and discrete hypothesis spaces, where the embedding should reflect the posterior distribution over hypotheses given the data. We then conduct empirical probing studies to show that transformers encode these three kinds of latent generating distributions, and that they perform well in out-of-distribution cases and without token memorization in these settings. △ Less

Submitted 5 June, 2024; originally announced June 2024.

Comments: 15 pages, 8 figures

ACM Class: I.2; I.5

arXiv:2406.03628 [pdf, other]

Synthetic Oversampling: Theory and A Practical Approach Using LLMs to Address Data Imbalance

Authors: Ryumei Nakada, Yichen Xu, Lexin Li, Linjun Zhang

Abstract: Imbalanced data and spurious correlations are common challenges in machine learning and data science. Oversampling, which artificially increases the number of instances in the underrepresented classes, has been widely adopted to tackle these challenges. In this article, we introduce OPAL (\textbf{O}versam\textbf{P}ling with \textbf{A}rtificial \textbf{L}LM-generated data), a systematic oversamplin… ▽ More Imbalanced data and spurious correlations are common challenges in machine learning and data science. Oversampling, which artificially increases the number of instances in the underrepresented classes, has been widely adopted to tackle these challenges. In this article, we introduce OPAL (\textbf{O}versam\textbf{P}ling with \textbf{A}rtificial \textbf{L}LM-generated data), a systematic oversampling approach that leverages the capabilities of large language models (LLMs) to generate high-quality synthetic data for minority groups. Recent studies on synthetic data generation using deep generative models mostly target prediction tasks. Our proposal differs in that we focus on handling imbalanced data and spurious correlations. More importantly, we develop a novel theory that rigorously characterizes the benefits of using the synthetic data, and shows the capacity of transformers in generating high-quality synthetic data for both labels and covariates. We further conduct intensive numerical experiments to demonstrate the efficacy of our proposed approach compared to some representative alternative solutions. △ Less

Submitted 5 June, 2024; originally announced June 2024.

Comments: 59 pages, 7 figures

arXiv:2406.02948 [pdf, other]

Copula-based semiparametric nonnormal transformed linear model for survival data with dependent censoring

Authors: Huazhen Yu, Lixin Zhang

Abstract: Although the independent censoring assumption is commonly used in survival analysis, it can be violated when the censoring time is related to the survival time, which often happens in many practical applications. To address this issue, we propose a flexible semiparametric method for dependent censored data. Our approach involves fitting the survival time and the censoring time with a joint transfo… ▽ More Although the independent censoring assumption is commonly used in survival analysis, it can be violated when the censoring time is related to the survival time, which often happens in many practical applications. To address this issue, we propose a flexible semiparametric method for dependent censored data. Our approach involves fitting the survival time and the censoring time with a joint transformed linear model, where the transformed function is unspecified. This allows for a very general class of models that can account for possible covariate effects, while also accommodating administrative censoring. We assume that the transformed variables have a bivariate nonnormal distribution based on parametric copulas and parametric marginals, which further enhances the flexibility of our method. We demonstrate the identifiability of the proposed model and establish the consistency and asymptotic normality of the model parameters under appropriate regularity conditions and assumptions. Furthermore, we evaluate the performance of our method through extensive simulation studies, and provide a real data example for illustration. △ Less

Submitted 5 June, 2024; originally announced June 2024.

arXiv:2406.01557 [pdf, other]

Bayesian compositional regression with flexible microbiome feature aggregation and selection

Authors: Satabdi Saha, Liangliang Zhang, Kim-Anh Do, Christine B. Peterson

Abstract: Ongoing advances in microbiome profiling have allowed unprecedented insights into the molecular activities of microbial communities. This has fueled a strong scientific interest in understanding the critical role the microbiome plays in governing human health, by identifying microbial features associated with clinical outcomes of interest. Several aspects of microbiome data limit the applicability… ▽ More Ongoing advances in microbiome profiling have allowed unprecedented insights into the molecular activities of microbial communities. This has fueled a strong scientific interest in understanding the critical role the microbiome plays in governing human health, by identifying microbial features associated with clinical outcomes of interest. Several aspects of microbiome data limit the applicability of existing variable selection approaches. In particular, microbiome data are high-dimensional, extremely sparse, and compositional. Importantly, many of the observed features, although categorized as different taxa, may play related functional roles. To address these challenges, we propose a novel compositional regression approach that leverages the data-adaptive clustering and variable selection properties of the spiked Dirichlet process to identify taxa that exhibit similar functional roles. Our proposed method, Bayesian Regression with Agglomerated Compositional Effects using a dirichLET process (BRACElet), enables the identification of a sparse set of features with shared impacts on the outcome, facilitating dimension reduction and model interpretation. We demonstrate that BRACElet outperforms existing approaches for microbiome variable selection through simulation studies and an application elucidating the impact of oral microbiome composition on insulin resistance. △ Less

Submitted 3 June, 2024; originally announced June 2024.

arXiv:2405.18373 [pdf, other]

A Hessian-Aware Stochastic Differential Equation for Modelling SGD

Authors: Xiang Li, Zebang Shen, Liang Zhang, Niao He

Abstract: Continuous-time approximation of Stochastic Gradient Descent (SGD) is a crucial tool to study its escaping behaviors from stationary points. However, existing stochastic differential equation (SDE) models fail to fully capture these behaviors, even for simple quadratic objectives. Built on a novel stochastic backward error analysis framework, we derive the Hessian-Aware Stochastic Modified Equatio… ▽ More Continuous-time approximation of Stochastic Gradient Descent (SGD) is a crucial tool to study its escaping behaviors from stationary points. However, existing stochastic differential equation (SDE) models fail to fully capture these behaviors, even for simple quadratic objectives. Built on a novel stochastic backward error analysis framework, we derive the Hessian-Aware Stochastic Modified Equation (HA-SME), an SDE that incorporates Hessian information of the objective function into both its drift and diffusion terms. Our analysis shows that HA-SME matches the order-best approximation error guarantee among existing SDE models in the literature, while achieving a significantly reduced dependence on the smoothness parameter of the objective. Further, for quadratic objectives, under mild conditions, HA-SME is proved to be the first SDE model that recovers exactly the SGD dynamics in the distributional sense. Consequently, when the local landscape near a stationary point can be approximated by quadratics, HA-SME is expected to accurately predict the local escaping behaviors of SGD. △ Less

Submitted 28 May, 2024; originally announced May 2024.

arXiv:2405.14780 [pdf, other]

Metric Flow Matching for Smooth Interpolations on the Data Manifold

Authors: Kacper Kapusniak, Peter Potaptchik, Teodora Reu, Leo Zhang, Alexander Tong, Michael Bronstein, Avishek Joey Bose, Francesco Di Giovanni

Abstract: Matching objectives underpin the success of modern generative models and rely on constructing conditional paths that transform a source distribution into a target distribution. Despite being a fundamental building block, conditional paths have been designed principally under the assumption of Euclidean geometry, resulting in straight interpolations. However, this can be particularly restrictive fo… ▽ More Matching objectives underpin the success of modern generative models and rely on constructing conditional paths that transform a source distribution into a target distribution. Despite being a fundamental building block, conditional paths have been designed principally under the assumption of Euclidean geometry, resulting in straight interpolations. However, this can be particularly restrictive for tasks such as trajectory inference, where straight paths might lie outside the data manifold, thus failing to capture the underlying dynamics giving rise to the observed marginals. In this paper, we propose Metric Flow Matching (MFM), a novel simulation-free framework for conditional flow matching where interpolants are approximate geodesics learned by minimizing the kinetic energy of a data-induced Riemannian metric. This way, the generative model matches vector fields on the data manifold, which corresponds to lower uncertainty and more meaningful interpolations. We prescribe general metrics to instantiate MFM, independent of the task, and test it on a suite of challenging problems including LiDAR navigation, unpaired image translation, and modeling cellular dynamics. We observe that MFM outperforms the Euclidean baselines, particularly achieving SOTA on single-cell trajectory prediction. △ Less

Submitted 23 May, 2024; originally announced May 2024.

arXiv:2405.04026 [pdf, other]

Federated Control in Markov Decision Processes

Authors: Hao Jin, Yang Peng, Liangyu Zhang, Zhihua Zhang

Abstract: We study problems of federated control in Markov Decision Processes. To solve an MDP with large state space, multiple learning agents are introduced to collaboratively learn its optimal policy without communication of locally collected experience. In our settings, these agents have limited capabilities, which means they are restricted within different regions of the overall state space during the… ▽ More We study problems of federated control in Markov Decision Processes. To solve an MDP with large state space, multiple learning agents are introduced to collaboratively learn its optimal policy without communication of locally collected experience. In our settings, these agents have limited capabilities, which means they are restricted within different regions of the overall state space during the training process. In face of the difference among restricted regions, we firstly introduce concepts of leakage probabilities to understand how such heterogeneity affects the learning process, and then propose a novel communication protocol that we call Federated-Q protocol (FedQ), which periodically aggregates agents' knowledge of their restricted regions and accordingly modifies their learning problems for further training. In terms of theoretical analysis, we justify the correctness of FedQ as a communication protocol, then give a general result on sample complexity of derived algorithms FedQ-X with the RL oracle , and finally conduct a thorough study on the sample complexity of FedQ-SynQ. Specifically, FedQ-X has been shown to enjoy linear speedup in terms of sample complexity when workload is uniformly distributed among agents. Moreover, we carry out experiments in various environments to justify the efficiency of our methods. △ Less

Submitted 7 May, 2024; originally announced May 2024.

arXiv:2405.03236 [pdf, other]

Federated Reinforcement Learning with Constraint Heterogeneity

Authors: Hao Jin, Liangyu Zhang, Zhihua Zhang

Abstract: We study a Federated Reinforcement Learning (FedRL) problem with constraint heterogeneity. In our setting, we aim to solve a reinforcement learning problem with multiple constraints while $N$ training agents are located in $N$ different environments with limited access to the constraint signals and they are expected to collaboratively learn a policy satisfying all constraint signals. Such learning… ▽ More We study a Federated Reinforcement Learning (FedRL) problem with constraint heterogeneity. In our setting, we aim to solve a reinforcement learning problem with multiple constraints while $N$ training agents are located in $N$ different environments with limited access to the constraint signals and they are expected to collaboratively learn a policy satisfying all constraint signals. Such learning problems are prevalent in scenarios of Large Language Model (LLM) fine-tuning and healthcare applications. To solve the problem, we propose federated primal-dual policy optimization methods based on traditional policy gradient methods. Specifically, we introduce $N$ local Lagrange functions for agents to perform local policy updates, and these agents are then scheduled to periodically communicate on their local policies. Taking natural policy gradient (NPG) and proximal policy optimization (PPO) as policy optimization methods, we mainly focus on two instances of our algorithms, ie, {FedNPG} and {FedPPO}. We show that FedNPG achieves global convergence with an $\tilde{O}(1/\sqrt{T})$ rate, and FedPPO efficiently solves complicated learning tasks with the use of deep neural networks. △ Less

Submitted 6 May, 2024; originally announced May 2024.

arXiv:2405.02225 [pdf, other]

Fair Risk Control: A Generalized Framework for Calibrating Multi-group Fairness Risks

Authors: Lujing Zhang, Aaron Roth, Linjun Zhang

Abstract: This paper introduces a framework for post-processing machine learning models so that their predictions satisfy multi-group fairness guarantees. Based on the celebrated notion of multicalibration, we introduce $(\mathbf{s},\mathcal{G}, α)-$GMC (Generalized Multi-Dimensional Multicalibration) for multi-dimensional mappings $\mathbf{s}$, constraint set $\mathcal{G}$, and a pre-specified threshold le… ▽ More This paper introduces a framework for post-processing machine learning models so that their predictions satisfy multi-group fairness guarantees. Based on the celebrated notion of multicalibration, we introduce $(\mathbf{s},\mathcal{G}, α)-$GMC (Generalized Multi-Dimensional Multicalibration) for multi-dimensional mappings $\mathbf{s}$, constraint set $\mathcal{G}$, and a pre-specified threshold level $α$. We propose associated algorithms to achieve this notion in general settings. This framework is then applied to diverse scenarios encompassing different fairness concerns, including false negative rate control in image segmentation, prediction set conditional uncertainty quantification in hierarchical classification, and de-biased text generation in language models. We conduct numerical studies on several datasets and tasks. △ Less

Submitted 3 May, 2024; originally announced May 2024.

Comments: 28 pages, 8 figures, accepted by ICML2024

arXiv:2404.16287 [pdf, other]

Differentially Private Federated Learning: Servers Trustworthiness, Estimation, and Statistical Inference

Authors: Zhe Zhang, Ryumei Nakada, Linjun Zhang

Abstract: Differentially private federated learning is crucial for maintaining privacy in distributed environments. This paper investigates the challenges of high-dimensional estimation and inference under the constraints of differential privacy. First, we study scenarios involving an untrusted central server, demonstrating the inherent difficulties of accurate estimation in high-dimensional problems. Our f… ▽ More Differentially private federated learning is crucial for maintaining privacy in distributed environments. This paper investigates the challenges of high-dimensional estimation and inference under the constraints of differential privacy. First, we study scenarios involving an untrusted central server, demonstrating the inherent difficulties of accurate estimation in high-dimensional problems. Our findings indicate that the tight minimax rates depends on the high-dimensionality of the data even with sparsity assumptions. Second, we consider a scenario with a trusted central server and introduce a novel federated estimation algorithm tailored for linear regression models. This algorithm effectively handles the slight variations among models distributed across different machines. We also propose methods for statistical inference, including coordinate-wise confidence intervals for individual parameters and strategies for simultaneous inference. Extensive simulation experiments support our theoretical advances, underscoring the efficacy and reliability of our approaches. △ Less

Submitted 24 April, 2024; originally announced April 2024.

Comments: 56 pages, 3 figures

arXiv:2404.09353 [pdf, other]

A Unified Combination Framework for Dependent Tests with Applications to Microbiome Association Studies

Authors: Xiufan Yu, Linjun Zhang, Arun Srinivasan, Min-ge Xie, Lingzhou Xue

Abstract: We introduce a novel meta-analysis framework to combine dependent tests under a general setting, and utilize it to synthesize various microbiome association tests that are calculated from the same dataset. Our development builds upon the classical meta-analysis methods of aggregating $p$-values and also a more recent general method of combining confidence distributions, but makes generalizations t… ▽ More We introduce a novel meta-analysis framework to combine dependent tests under a general setting, and utilize it to synthesize various microbiome association tests that are calculated from the same dataset. Our development builds upon the classical meta-analysis methods of aggregating $p$-values and also a more recent general method of combining confidence distributions, but makes generalizations to handle dependent tests. The proposed framework ensures rigorous statistical guarantees, and we provide a comprehensive study and compare it with various existing dependent combination methods. Notably, we demonstrate that the widely used Cauchy combination method for dependent tests, referred to as the vanilla Cauchy combination in this article, can be viewed as a special case within our framework. Moreover, the proposed framework provides a way to address the problem when the distributional assumptions underlying the vanilla Cauchy combination are violated. Our numerical results demonstrate that ignoring the dependence among the to-be-combined components may lead to a severe size distortion phenomenon. Compared to the existing $p$-value combination methods, including the vanilla Cauchy combination method, the proposed combination framework can handle the dependence accurately and utilizes the information efficiently to construct tests with accurate size and enhanced power. The development is applied to Microbiome Association Studies, where we aggregate information from multiple existing tests using the same dataset. The combined tests harness the strengths of each individual test across a wide range of alternative spaces, %resulting in a significant enhancement of testing power across a wide range of alternative spaces, enabling more efficient and meaningful discoveries of vital microbiome associations. △ Less

Submitted 14 April, 2024; originally announced April 2024.

arXiv:2404.01608 [pdf, ps, other]

FAIRM: Learning invariant representations for algorithmic fairness and domain generalization with minimax optimality

Authors: Sai Li, Linjun Zhang

Abstract: Machine learning methods often assume that the test data have the same distribution as the training data. However, this assumption may not hold due to multiple levels of heterogeneity in applications, raising issues in algorithmic fairness and domain generalization. In this work, we address the problem of fair and generalizable machine learning by invariant principles. We propose a training enviro… ▽ More Machine learning methods often assume that the test data have the same distribution as the training data. However, this assumption may not hold due to multiple levels of heterogeneity in applications, raising issues in algorithmic fairness and domain generalization. In this work, we address the problem of fair and generalizable machine learning by invariant principles. We propose a training environment-based oracle, FAIRM, which has desirable fairness and domain generalization properties under a diversity-type condition. We then provide an empirical FAIRM with finite-sample theoretical guarantees under weak distributional assumptions. We then develop efficient algorithms to realize FAIRM in linear models and demonstrate the nonasymptotic performance with minimax optimality. We evaluate our method in numerical experiments with synthetic data and MNIST data and show that it outperforms its counterparts. △ Less

Submitted 1 April, 2024; originally announced April 2024.

arXiv:2403.14926 [pdf, other]

Contrastive Learning on Multimodal Analysis of Electronic Health Records

Authors: Tianxi Cai, Feiqing Huang, Ryumei Nakada, Linjun Zhang, Doudou Zhou

Abstract: Electronic health record (EHR) systems contain a wealth of multimodal clinical data including structured data like clinical codes and unstructured data such as clinical notes. However, many existing EHR-focused studies has traditionally either concentrated on an individual modality or merged different modalities in a rather rudimentary fashion. This approach often results in the perception of stru… ▽ More Electronic health record (EHR) systems contain a wealth of multimodal clinical data including structured data like clinical codes and unstructured data such as clinical notes. However, many existing EHR-focused studies has traditionally either concentrated on an individual modality or merged different modalities in a rather rudimentary fashion. This approach often results in the perception of structured and unstructured data as separate entities, neglecting the inherent synergy between them. Specifically, the two important modalities contain clinically relevant, inextricably linked and complementary health information. A more complete picture of a patient's medical history is captured by the joint analysis of the two modalities of data. Despite the great success of multimodal contrastive learning on vision-language, its potential remains under-explored in the realm of multimodal EHR, particularly in terms of its theoretical understanding. To accommodate the statistical analysis of multimodal EHR data, in this paper, we propose a novel multimodal feature embedding generative model and design a multimodal contrastive loss to obtain the multimodal EHR feature representation. Our theoretical analysis demonstrates the effectiveness of multimodal learning compared to single-modality learning and connects the solution of the loss function to the singular value decomposition of a pointwise mutual information matrix. This connection paves the way for a privacy-preserving algorithm tailored for multimodal EHR feature representation learning. Simulation studies show that the proposed algorithm performs well under a variety of configurations. We further validate the clinical utility of the proposed algorithm in real-world EHR data. △ Less

Submitted 21 March, 2024; originally announced March 2024.

Comments: 34 pages

arXiv:2403.12859 [pdf, other]

Primal Methods for Variational Inequality Problems with Functional Constraints

Authors: Liang Zhang, Niao He, Michael Muehlebach

Abstract: Constrained variational inequality problems are recognized for their broad applications across various fields including machine learning and operations research. First-order methods have emerged as the standard approach for solving these problems due to their simplicity and scalability. However, they typically rely on projection or linear minimization oracles to navigate the feasible set, which be… ▽ More Constrained variational inequality problems are recognized for their broad applications across various fields including machine learning and operations research. First-order methods have emerged as the standard approach for solving these problems due to their simplicity and scalability. However, they typically rely on projection or linear minimization oracles to navigate the feasible set, which becomes computationally expensive in practical scenarios featuring multiple functional constraints. Existing efforts to tackle such functional constrained variational inequality problems have centered on primal-dual algorithms grounded in the Lagrangian function. These algorithms along with their theoretical analysis often require the existence and prior knowledge of the optimal Lagrange multipliers. In this work, we propose a simple primal method, termed Constrained Gradient Method (CGM), for addressing functional constrained variational inequality problems, without necessitating any information on the optimal Lagrange multipliers. We establish a non-asymptotic convergence analysis of the algorithm for variational inequality problems with monotone operators under smooth constraints. Remarkably, our algorithms match the complexity of projection-based methods in terms of operator queries for both monotone and strongly monotone settings, while utilizing significantly cheaper oracles based on quadratic programming. Furthermore, we provide several numerical examples to evaluate the efficacy of our algorithms. △ Less

Submitted 19 March, 2024; originally announced March 2024.

arXiv:2403.09984 [pdf, ps, other]

Repro Samples Method for High-dimensional Logistic Model

Authors: Xiaotian Hou, Linjun Zhang, Peng Wang, Min-ge Xie

Abstract: This paper presents a novel method to make statistical inferences for both the model support and regression coefficients in a high-dimensional logistic regression model. Our method is based on the repro samples framework, in which we conduct statistical inference by generating artificial samples mimicking the actual data-generating process. The proposed method has two major advantages. Firstly, fo… ▽ More This paper presents a novel method to make statistical inferences for both the model support and regression coefficients in a high-dimensional logistic regression model. Our method is based on the repro samples framework, in which we conduct statistical inference by generating artificial samples mimicking the actual data-generating process. The proposed method has two major advantages. Firstly, for model support, we introduce the first method for constructing model confidence set in a high-dimensional setting and the proposed method only requires a weak signal strength assumption. Secondly, in terms of regression coefficients, we establish confidence sets for any group of linear combinations of regression coefficients. Our simulation results demonstrate that the proposed method produces valid and small model confidence sets and achieves better coverage for regression coefficients than the state-of-the-art debiasing methods. Additionally, we analyze single-cell RNA-seq data on the immune response. Besides identifying genes previously proved as relevant in the literature, our method also discovers a significant gene that has not been studied before, revealing a potential new direction in understanding cellular immune response mechanisms. △ Less

Submitted 14 March, 2024; originally announced March 2024.

arXiv:2403.05811 [pdf, ps, other]

Near Minimax-Optimal Distributional Temporal Difference Algorithms and The Freedman Inequality in Hilbert Spaces

Authors: Yang Peng, Liangyu Zhang, Zhihua Zhang

Abstract: Distributional reinforcement learning (DRL) has achieved empirical success in various domains. One of the core tasks in the field of DRL is distributional policy evaluation, which involves estimating the return distribution $η^π$ for a given policy $π$. The distributional temporal difference (TD) algorithm has been accordingly proposed, which is an extension of the temporal difference algorithm in… ▽ More Distributional reinforcement learning (DRL) has achieved empirical success in various domains. One of the core tasks in the field of DRL is distributional policy evaluation, which involves estimating the return distribution $η^π$ for a given policy $π$. The distributional temporal difference (TD) algorithm has been accordingly proposed, which is an extension of the temporal difference algorithm in the classic RL literature. In the tabular case, \citet{rowland2018analysis} and \citet{rowland2023analysis} proved the asymptotic convergence of two instances of distributional TD, namely categorical temporal difference algorithm (CTD) and quantile temporal difference algorithm (QTD), respectively. In this paper, we go a step further and analyze the finite-sample performance of distributional TD. To facilitate theoretical analysis, we propose a non-parametric distributional TD algorithm (NTD). For a $γ$-discounted infinite-horizon tabular Markov decision process, we show that for NTD we need $\tilde{O}\left(\frac{1}{\varepsilon^{2p}(1-γ)^{2p+1}}\right)$ iterations to achieve an $\varepsilon$-optimal estimator with high probability, when the estimation error is measured by the $p$-Wasserstein distance. This sample complexity bound is minimax optimal (up to logarithmic factors) in the case of the $1$-Wasserstein distance. To achieve this, we establish a novel Freedman's inequality in Hilbert spaces, which would be of independent interest. In addition, we revisit CTD, showing that the same non-asymptotic convergence bounds hold for CTD in the case of the $p$-Wasserstein distance. △ Less

Submitted 14 March, 2024; v1 submitted 9 March, 2024; originally announced March 2024.

arXiv:2403.05006 [pdf, ps, other]

Provable Multi-Party Reinforcement Learning with Diverse Human Feedback

Authors: Huiying Zhong, Zhun Deng, Weijie J. Su, Zhiwei Steven Wu, Linjun Zhang

Abstract: Reinforcement learning with human feedback (RLHF) is an emerging paradigm to align models with human preferences. Typically, RLHF aggregates preferences from multiple individuals who have diverse viewpoints that may conflict with each other. Our work \textit{initiates} the theoretical study of multi-party RLHF that explicitly models the diverse preferences of multiple individuals. We show how trad… ▽ More Reinforcement learning with human feedback (RLHF) is an emerging paradigm to align models with human preferences. Typically, RLHF aggregates preferences from multiple individuals who have diverse viewpoints that may conflict with each other. Our work \textit{initiates} the theoretical study of multi-party RLHF that explicitly models the diverse preferences of multiple individuals. We show how traditional RLHF approaches can fail since learning a single reward function cannot capture and balance the preferences of multiple individuals. To overcome such limitations, we incorporate meta-learning to learn multiple preferences and adopt different social welfare functions to aggregate the preferences across multiple parties. We focus on the offline learning setting and establish sample complexity bounds, along with efficiency and fairness guarantees, for optimizing diverse social welfare functions such as Nash, Utilitarian, and Leximin welfare functions. Our results show a separation between the sample complexities of multi-party RLHF and traditional single-party RLHF. Furthermore, we consider a reward-free setting, where each individual's preference is no longer consistent with a reward model, and give pessimistic variants of the von Neumann Winner based on offline preference data. Taken together, our work showcases the advantage of multi-party RLHF but also highlights its more demanding statistical complexity. △ Less

Submitted 7 March, 2024; originally announced March 2024.

arXiv:2403.03562 [pdf, other]

Efficient Algorithms for Empirical Group Distributional Robust Optimization and Beyond

Authors: Dingzhi Yu, Yunuo Cai, Wei Jiang, Lijun Zhang

Abstract: We investigate the empirical counterpart of group distributionally robust optimization (GDRO), which aims to minimize the maximal empirical risk across $m$ distinct groups. We formulate empirical GDRO as a $\textit{two-level}$ finite-sum convex-concave minimax optimization problem and develop a stochastic variance reduced mirror prox algorithm. Unlike existing methods, we construct the stochastic… ▽ More We investigate the empirical counterpart of group distributionally robust optimization (GDRO), which aims to minimize the maximal empirical risk across $m$ distinct groups. We formulate empirical GDRO as a $\textit{two-level}$ finite-sum convex-concave minimax optimization problem and develop a stochastic variance reduced mirror prox algorithm. Unlike existing methods, we construct the stochastic gradient by per-group sampling technique and perform variance reduction for all groups, which fully exploits the $\textit{two-level}$ finite-sum structure of empirical GDRO. Furthermore, we compute the snapshot and mirror snapshot point by a one-index-shifted weighted average, which distinguishes us from the naive ergodic average. Our algorithm also supports non-constant learning rates, which is different from existing literature. We establish convergence guarantees both in expectation and with high probability, demonstrating a complexity of $\mathcal{O}\left(\frac{m\sqrt{\bar{n}\ln{m}}}{\varepsilon}\right)$, where $\bar n$ is the average number of samples among $m$ groups. Remarkably, our approach outperforms the state-of-the-art method by a factor of $\sqrt{m}$. Furthermore, we extend our methodology to deal with the empirical minimax excess risk optimization (MERO) problem and manage to give the expectation bound and the high probability bound, accordingly. The complexity of our empirical MERO algorithm matches that of empirical GDRO at $\mathcal{O}\left(\frac{m\sqrt{\bar{n}\ln{m}}}{\varepsilon}\right)$, significantly surpassing the bounds of existing methods. △ Less

Submitted 6 March, 2024; originally announced March 2024.

Comments: 30 pages, 1 figure

arXiv:2402.16158 [pdf, other]

Distribution-Free Fair Federated Learning with Small Samples

Authors: Qichuan Yin, Junzhou Huang, Huaxiu Yao, Linjun Zhang

Abstract: As federated learning gains increasing importance in real-world applications due to its capacity for decentralized data training, addressing fairness concerns across demographic groups becomes critically important. However, most existing machine learning algorithms for ensuring fairness are designed for centralized data environments and generally require large-sample and distributional assumptions… ▽ More As federated learning gains increasing importance in real-world applications due to its capacity for decentralized data training, addressing fairness concerns across demographic groups becomes critically important. However, most existing machine learning algorithms for ensuring fairness are designed for centralized data environments and generally require large-sample and distributional assumptions, underscoring the urgent need for fairness techniques adapted for decentralized and heterogeneous systems with finite-sample and distribution-free guarantees. To address this issue, this paper introduces FedFaiREE, a post-processing algorithm developed specifically for distribution-free fair learning in decentralized settings with small samples. Our approach accounts for unique challenges in decentralized environments, such as client heterogeneity, communication costs, and small sample sizes. We provide rigorous theoretical guarantees for both fairness and accuracy, and our experimental results further provide robust empirical validation for our proposed method. △ Less

Submitted 25 February, 2024; originally announced February 2024.

arXiv:2401.08150 [pdf, other]

Differentially Private Sliced Inverse Regression: Minimax Optimality and Algorithm

Authors: Xintao Xia, Linjun Zhang, Zhanrui Cai

Abstract: Privacy preservation has become a critical concern in high-dimensional data analysis due to the growing prevalence of data-driven applications. Proposed by Li (1991), sliced inverse regression has emerged as a widely utilized statistical technique for reducing covariate dimensionality while maintaining sufficient statistical information. In this paper, we propose optimally differentially private a… ▽ More Privacy preservation has become a critical concern in high-dimensional data analysis due to the growing prevalence of data-driven applications. Proposed by Li (1991), sliced inverse regression has emerged as a widely utilized statistical technique for reducing covariate dimensionality while maintaining sufficient statistical information. In this paper, we propose optimally differentially private algorithms specifically designed to address privacy concerns in the context of sufficient dimension reduction. We proceed to establish lower bounds for differentially private sliced inverse regression in both the low and high-dimensional settings. Moreover, we develop differentially private algorithms that achieve the minimax lower bounds up to logarithmic factors. Through a combination of simulations and real data analysis, we illustrate the efficacy of these differentially private algorithms in safeguarding privacy while preserving vital information within the reduced dimension space. As a natural extension, we can readily offer analogous lower and upper bounds for differentially private sparse principal component analysis, a topic that may also be of potential interest to the statistical and machine learning community. △ Less

Submitted 16 January, 2024; originally announced January 2024.

arXiv:2401.07267 [pdf, other]

Inference for high-dimensional linear expectile regression with de-biased method

Authors: Xiang Li, Yu-Ning Li, Li-Xin Zhang, Jun Zhao

Abstract: In this paper, we address the inference problem in high-dimensional linear expectile regression. We transform the expectile loss into a weighted-least-squares form and apply a de-biased strategy to establish Wald-type tests for multiple constraints within a regularized framework. Simultaneously, we construct an estimator for the pseudo-inverse of the generalized Hessian matrix in high dimension wi… ▽ More In this paper, we address the inference problem in high-dimensional linear expectile regression. We transform the expectile loss into a weighted-least-squares form and apply a de-biased strategy to establish Wald-type tests for multiple constraints within a regularized framework. Simultaneously, we construct an estimator for the pseudo-inverse of the generalized Hessian matrix in high dimension with general amenable regularizers including Lasso and SCAD, and demonstrate its consistency through a new proof technique. We conduct simulation studies and real data applications to demonstrate the efficacy of our proposed test statistic in both homoscedastic and heteroscedastic scenarios. △ Less

Submitted 14 January, 2024; originally announced January 2024.

Comments: 34 pages

MSC Class: 62F05; 62F12; 62J12

arXiv:2401.02708 [pdf, other]

TripleSurv: Triplet Time-adaptive Coordinate Loss for Survival Analysis

Authors: Liwen Zhang, Lianzhen Zhong, Fan Yang, Di Dong, Hui Hui, Jie Tian

Abstract: A core challenge in survival analysis is to model the distribution of censored time-to-event data, where the event of interest may be a death, failure, or occurrence of a specific event. Previous studies have showed that ranking and maximum likelihood estimation (MLE)loss functions are widely-used for survival analysis. However, ranking loss only focus on the ranking of survival time and does not… ▽ More A core challenge in survival analysis is to model the distribution of censored time-to-event data, where the event of interest may be a death, failure, or occurrence of a specific event. Previous studies have showed that ranking and maximum likelihood estimation (MLE)loss functions are widely-used for survival analysis. However, ranking loss only focus on the ranking of survival time and does not consider potential effect of samples for exact survival time values. Furthermore, the MLE is unbounded and easily subject to outliers (e.g., censored data), which may cause poor performance of modeling. To handle the complexities of learning process and exploit valuable survival time values, we propose a time-adaptive coordinate loss function, TripleSurv, to achieve adaptive adjustments by introducing the differences in the survival time between sample pairs into the ranking, which can encourage the model to quantitatively rank relative risk of pairs, ultimately enhancing the accuracy of predictions. Most importantly, the TripleSurv is proficient in quantifying the relative risk between samples by ranking ordering of pairs, and consider the time interval as a trade-off to calibrate the robustness of model over sample distribution. Our TripleSurv is evaluated on three real-world survival datasets and a public synthetic dataset. The results show that our method outperforms the state-of-the-art methods and exhibits good model performance and robustness on modeling various sophisticated data distributions with different censor rates. Our code will be available upon acceptance. △ Less

Submitted 5 January, 2024; originally announced January 2024.

Comments: 9 pages,6 figures

arXiv:2312.16004 [pdf, other]

Computing Gerber-Shiu function in the classical risk model with interest using collocation method

Authors: Zan Yu, Lianzeng Zhang

Abstract: The Gerber-Shiu function is a classical research topic in actuarial science.However, exact solutions are only available in the literature for very specific cases where the claim amounts follow distributions such as the exponential distribution. This presents a longstanding challenge, particularly from a computational perspective. For the classical risk process in continuous time, the Gerber-Shiu d… ▽ More The Gerber-Shiu function is a classical research topic in actuarial science.However, exact solutions are only available in the literature for very specific cases where the claim amounts follow distributions such as the exponential distribution. This presents a longstanding challenge, particularly from a computational perspective. For the classical risk process in continuous time, the Gerber-Shiu discounted penalty function satisfies a class of Volterra integral equations. In this paper, we use the collocation method to compute the Gerber-Shiu function for risk model with interest. Our methodology demonstrates that the function can be expressed as a linear algebraic system, which is straightforward to implement. One major advantage of our approach is that it does not require any specific distributional assumptions on the claim amounts, except for mild differentiability and continuity conditions that can be easily verified. We also examine the convergence orders of the collocation method. Finally, we present several numerical examples to illustrate the desirable performance of our proposed method. △ Less

Submitted 26 December, 2023; originally announced December 2023.

Comments: 24 pages

arXiv:2312.14226 [pdf, other]

Deep de Finetti: Recovering Topic Distributions from Large Language Models

Authors: Liyi Zhang, R. Thomas McCoy, Theodore R. Sumers, Jian-Qiao Zhu, Thomas L. Griffiths

Abstract: Large language models (LLMs) can produce long, coherent passages of text, suggesting that LLMs, although trained on next-word prediction, must represent the latent structure that characterizes a document. Prior work has found that internal representations of LLMs encode one aspect of latent structure, namely syntax; here we investigate a complementary aspect, namely the document's topic structure.… ▽ More Large language models (LLMs) can produce long, coherent passages of text, suggesting that LLMs, although trained on next-word prediction, must represent the latent structure that characterizes a document. Prior work has found that internal representations of LLMs encode one aspect of latent structure, namely syntax; here we investigate a complementary aspect, namely the document's topic structure. We motivate the hypothesis that LLMs capture topic structure by connecting LLM optimization to implicit Bayesian inference. De Finetti's theorem shows that exchangeable probability distributions can be represented as a mixture with respect to a latent generating distribution. Although text is not exchangeable at the level of syntax, exchangeability is a reasonable starting assumption for topic structure. We thus hypothesize that predicting the next token in text will lead LLMs to recover latent topic distributions. We examine this hypothesis using Latent Dirichlet Allocation (LDA), an exchangeable probabilistic topic model, as a target, and we show that the representations formed by LLMs encode both the topics used to generate synthetic data and those used to explain natural corpus data. △ Less

Submitted 21 December, 2023; originally announced December 2023.

Comments: 13 pages, 4 figures

ACM Class: I.2.6; I.2.7

arXiv:2312.10706 [pdf, other]

Margin-closed regime-switching multivariate time series models

Authors: Lin Zhang, Harry Joe, Natalia Nolde

Abstract: A regime-switching multivariate time series model which is closed under margins is built. The model imposes a restriction on all lower-dimensional sub-processes to follow a regime-switching process sharing the same latent regime sequence and having the same Markov order as the original process. The margin-closed regime-switching model is constructed by considering the multivariate margin-closed Ga… ▽ More A regime-switching multivariate time series model which is closed under margins is built. The model imposes a restriction on all lower-dimensional sub-processes to follow a regime-switching process sharing the same latent regime sequence and having the same Markov order as the original process. The margin-closed regime-switching model is constructed by considering the multivariate margin-closed Gaussian VAR($k$) dependence as a copula within each regime, and builds dependence between observations in different regimes by requiring the first observation in the new regime to depend on the last observation in the previous regime. The property of closure under margins allows inference on the latent regimes based on lower-dimensional selected sub-processes and estimation of univariate parameters from univariate sub-processes, and enables the use of multi-stage estimation procedure for the model. The parsimonious dependence structure of the model also avoids a large number of parameters under the regime-switching setting. The proposed model is applied to a macroeconomic data set to infer the latent business cycle and compared with the relevant benchmark. △ Less

Submitted 17 December, 2023; originally announced December 2023.

arXiv:2312.04610 [pdf]

Data-driven Semi-supervised Machine Learning with Surrogate Safety Measures for Abnormal Driving Behavior Detection

Authors: Yongqi Dong, Lanxin Zhang, Haneen Farah, Arkady Zgonnikov, Bart van Arem

Abstract: Detecting abnormal driving behavior is critical for road traffic safety and the evaluation of drivers' behavior. With the advancement of machine learning (ML) algorithms and the accumulation of naturalistic driving data, many ML models have been adopted for abnormal driving behavior detection. Most existing ML-based detectors rely on (fully) supervised ML methods, which require substantial labeled… ▽ More Detecting abnormal driving behavior is critical for road traffic safety and the evaluation of drivers' behavior. With the advancement of machine learning (ML) algorithms and the accumulation of naturalistic driving data, many ML models have been adopted for abnormal driving behavior detection. Most existing ML-based detectors rely on (fully) supervised ML methods, which require substantial labeled data. However, ground truth labels are not always available in the real world, and labeling large amounts of data is tedious. Thus, there is a need to explore unsupervised or semi-supervised methods to make the anomaly detection process more feasible and efficient. To fill this research gap, this study analyzes large-scale real-world data revealing several abnormal driving behaviors (e.g., sudden acceleration, rapid lane-changing) and develops a Hierarchical Extreme Learning Machines (HELM) based semi-supervised ML method using partly labeled data to accurately detect the identified abnormal driving behaviors. Moreover, previous ML-based approaches predominantly utilize basic vehicle motion features (such as velocity and acceleration) to label and detect abnormal driving behaviors, while this study seeks to introduce Surrogate Safety Measures (SSMs) as the input features for ML models to improve the detection performance. Results from extensive experiments demonstrate the effectiveness of the proposed semi-supervised ML model with the introduced SSMs serving as important features. The proposed semi-supervised ML method outperforms other baseline semi-supervised or unsupervised methods regarding various metrics, e.g., delivering the best accuracy at 99.58% and the best F-1 measure at 0.9913. The ablation study further highlights the significance of SSMs for advancing detection performance. △ Less

Submitted 24 May, 2024; v1 submitted 7 December, 2023; originally announced December 2023.

Comments: 22 pages, 10 figures, accepted by the 103rd Transportation Research Board (TRB) Annual Meeting, under third round review by Transportation Research Record: Journal of the Transportation Research Board

arXiv:2312.02660 [pdf, other]

Uniswap Daily Transaction Indices by Network

Authors: Nir Chemaya, Lin William Cong, Emma Jorgensen, Dingyue Liu, Luyao Zhang

Abstract: DeFi is transforming financial services by removing intermediaries and producing a wealth of open-source data. This transformation is propelled by Layer 2 (L2) solutions, aimed at boosting network efficiency and scalability beyond current Layer 1 (L1) capabilities. This study addresses the lack of detailed L2 impact analysis by examining over 50 million transactions from Uniswap. Our dataset, feat… ▽ More DeFi is transforming financial services by removing intermediaries and producing a wealth of open-source data. This transformation is propelled by Layer 2 (L2) solutions, aimed at boosting network efficiency and scalability beyond current Layer 1 (L1) capabilities. This study addresses the lack of detailed L2 impact analysis by examining over 50 million transactions from Uniswap. Our dataset, featuring transactions from L1 and L2 across networks like Ethereum and Polygon, provides daily indices revealing adoption, scalability, and decentralization within the DeFi space. These indices help to elucidate the complex relationship between DeFi and L2 technologies, advancing our understanding of the ecosystem. The dataset is enhanced by an open-source Python framework for computing decentralization indices, adaptable for various research needs. This positions the dataset as a vital resource for machine learning endeavors, particularly deep learning, contributing significantly to the development of Blockchain as Web3's infrastructure. △ Less

Submitted 5 December, 2023; originally announced December 2023.

arXiv:2312.00219 [pdf, other]

The Functional Average Treatment Effect

Authors: Shane Sparkes, Erika Garcia, Lu Zhang

Abstract: This paper establishes the functional average as an important estimand for causal inference. The significance of the estimand lies in its robustness against traditional issues of confounding. We prove that this robustness holds even when the probability distribution of the outcome, conditional on treatment or some other vector of adjusting variables, differs almost arbitrarily from its counterfact… ▽ More This paper establishes the functional average as an important estimand for causal inference. The significance of the estimand lies in its robustness against traditional issues of confounding. We prove that this robustness holds even when the probability distribution of the outcome, conditional on treatment or some other vector of adjusting variables, differs almost arbitrarily from its counterfactual analogue. This paper also examines possible estimators of the functional average, including the sample mid-range, and proposes a new type of bootstrap for robust statistical inference: the Hoeffding bootstrap. After this, the paper explores a new class of variables, the $\mathcal{U}$ class of variables, that simplifies the estimation of functional averages. This class of variables is also used to establish mean exchangeability in some cases and to provide the results of elementary statistical procedures, such as linear regression and the analysis of variance, with causal interpretations. Simulation evidence is provided. The methods of this paper are also applied to a National Health and Nutrition Survey data set to investigate the causal effect of exercise on the blood pressure of adult smokers. △ Less

Submitted 30 November, 2023; originally announced December 2023.

Comments: 52 pages (40 main document; 12 supplementary), 1 figure

MSC Class: 60E05; 62J99 (Primary); 62G30; 62G32 (Secondary)

arXiv:2311.17476 [pdf, other]

Inference of Sample Complier Average Causal Effects in Completely Randomized Experiments

Authors: Zhen Zhong, Per Johansson, Junni L. Zhang

Abstract: In randomized experiments with non-compliance scholars have argued that the complier average causal effect (CACE) ought to be the main causal estimand. The literature on inference of the complier average treatment effect (CACE) has focused on inference about the population CACE. However, in general individuals in the experiments are volunteers. This means that there is a risk that individuals part… ▽ More In randomized experiments with non-compliance scholars have argued that the complier average causal effect (CACE) ought to be the main causal estimand. The literature on inference of the complier average treatment effect (CACE) has focused on inference about the population CACE. However, in general individuals in the experiments are volunteers. This means that there is a risk that individuals partaking in a given experiment differ in important ways from a population of interest. It is thus of interest to focus on the sample at hand and have easy to use and correct procedures for inference about the sample CACE. We consider a more general setting than in the previous literature and construct a confidence interval based on the Wald estimator in the form of a finite closed interval that is familiar to practitioners. Furthermore, with the access of pre-treatment covariates, we propose a new regression adjustment estimator and associated methods for constructing confidence intervals. Finite sample performance of the methods is examined through a Monte Carlo simulation and the methods are used in an application to a job training experiment. △ Less

Submitted 29 November, 2023; originally announced November 2023.

arXiv:2311.17445 [pdf, ps, other]

Interaction tests with covariate-adaptive randomization

Authors: Likun Zhang, Wei Ma

Abstract: Treatment-covariate interaction tests are commonly applied by researchers to examine whether the treatment effect varies across patient subgroups defined by baseline characteristics. The objective of this study is to explore treatment-covariate interaction tests involving covariate-adaptive randomization. Without assuming a parametric data generating model, we investigate usual interaction tests a… ▽ More Treatment-covariate interaction tests are commonly applied by researchers to examine whether the treatment effect varies across patient subgroups defined by baseline characteristics. The objective of this study is to explore treatment-covariate interaction tests involving covariate-adaptive randomization. Without assuming a parametric data generating model, we investigate usual interaction tests and observe that they tend to be conservative: specifically, their limiting rejection probabilities under the null hypothesis do not exceed the nominal level and are typically strictly lower than it. To address this problem, we propose modifications to the usual tests to obtain corresponding valid tests. Moreover, we introduce a novel class of stratified-adjusted interaction tests that are simple, more powerful than the usual and modified tests, and broadly applicable to most covariate-adaptive randomization methods. The results are general to encompass two types of interaction tests: one involving stratification covariates and the other involving additional covariates that are not used for randomization. Our study clarifies the application of interaction tests in clinical trials and offers valuable tools for revealing treatment heterogeneity, crucial for advancing personalized medicine. △ Less

Submitted 10 March, 2024; v1 submitted 29 November, 2023; originally announced November 2023.

arXiv:2311.14676 [pdf, other]

doi 10.31219/osf.io/bq6tu

Decoding Social Sentiment in DAO: A Comparative Analysis of Blockchain Governance Communities

Authors: Yutong Quan, Xintong Wu, Wanlin Deng, Luyao Zhang

Abstract: Blockchain technology is leading a revolutionary transformation across diverse industries, with effective governance being critical for the success and sustainability of blockchain projects. Community forums, pivotal in engaging decentralized autonomous organizations (DAOs), significantly impact blockchain governance decisions. Concurrently, Natural Language Processing (NLP), particularly sentimen… ▽ More Blockchain technology is leading a revolutionary transformation across diverse industries, with effective governance being critical for the success and sustainability of blockchain projects. Community forums, pivotal in engaging decentralized autonomous organizations (DAOs), significantly impact blockchain governance decisions. Concurrently, Natural Language Processing (NLP), particularly sentiment analysis, provides powerful insights from textual data. While prior research has explored the potential of NLP tools in social media sentiment analysis, there is a gap in understanding the sentiment landscape of blockchain governance communities. The evolving discourse and sentiment dynamics on the forums of top DAOs remain largely unknown. This paper delves deep into the evolving discourse and sentiment dynamics on the public forums of leading DeFi projects: Aave, Uniswap, Curve DAO, Yearn.finance, Merit Circle, and Balancer, focusing primarily on discussions related to governance issues. Our study shows that participants in decentralized communities generally express positive sentiments during Discord discussions. Furthermore, there is a potential interaction between discussion intensity and sentiment dynamics; higher discussion volume may contribute to a more stable sentiment from code analysis. The insights gained from this study are valuable for decision-makers in blockchain governance, underscoring the pivotal role of sentiment analysis in interpreting community emotions and its evolving impact on the landscape of blockchain governance. This research significantly contributes to the interdisciplinary exploration of the intersection of blockchain and society, specifically emphasizing the decentralized blockchain governance ecosystem. We provide our data and code for replicability as open access on GitHub. △ Less

Submitted 25 May, 2024; v1 submitted 31 October, 2023; originally announced November 2023.

arXiv:2311.11256 [pdf, other]

Bayesian Modeling of Incompatible Spatial Data: A Case Study Involving Post-Adrian Storm Forest Damage Assessment

Authors: Lu Zhang, Andrew O. Finley, Arne Nothdurft, Sudipto Banerjee

Abstract: Incompatible spatial data modeling is a pervasive challenge in remote sensing data analysis that involves field data. Typical approaches to addressing this challenge aggregate information to a coarser common scale, i.e., compatible resolutions. Such pre-processing aggregation to a common resolution simplifies analysis, but potentially causes information loss and hence compromised inference and pre… ▽ More Incompatible spatial data modeling is a pervasive challenge in remote sensing data analysis that involves field data. Typical approaches to addressing this challenge aggregate information to a coarser common scale, i.e., compatible resolutions. Such pre-processing aggregation to a common resolution simplifies analysis, but potentially causes information loss and hence compromised inference and predictive performance. To incorporate finer information to enhance prediction performance, we develop a new Bayesian method aimed at improving predictive accuracy and uncertainty quantification. The main contribution of this work is an efficient algorithm that enables full Bayesian inference using finer resolution data while optimizing computational and storage costs. The algorithm is developed and applied to a forest damage assessment for the 2018 Adrian storm in Carinthia, Austria, which uses field data and high-resolution LiDAR measurements. Simulation studies demonstrate that this approach substantially improves prediction accuracy and stability, providing more reliable inference to support forest management decisions. △ Less

Submitted 19 November, 2023; originally announced November 2023.

Comments: 15 pages, 10 figures

arXiv:2311.10638 [pdf, other]

Concept-free Causal Disentanglement with Variational Graph Auto-Encoder

Authors: Jingyun Feng, Lin Zhang, Lili Yang

Abstract: In disentangled representation learning, the goal is to achieve a compact representation that consists of all interpretable generative factors in the observational data. Learning disentangled representations for graphs becomes increasingly important as graph data rapidly grows. Existing approaches often rely on Variational Auto-Encoder (VAE) or its causal structure learning-based refinement, which… ▽ More In disentangled representation learning, the goal is to achieve a compact representation that consists of all interpretable generative factors in the observational data. Learning disentangled representations for graphs becomes increasingly important as graph data rapidly grows. Existing approaches often rely on Variational Auto-Encoder (VAE) or its causal structure learning-based refinement, which suffer from sub-optimality in VAEs due to the independence factor assumption and unavailability of concept labels, respectively. In this paper, we propose an unsupervised solution, dubbed concept-free causal disentanglement, built on a theoretically provable tight upper bound approximating the optimal factor. This results in an SCM-like causal structure modeling that directly learns concept structures from data. Based on this idea, we propose Concept-free Causal VGAE (CCVGAE) by incorporating a novel causal disentanglement layer into Variational Graph Auto-Encoder. Furthermore, we prove concept consistency under our concept-free causal disentanglement framework, hence employing it to enhance the meta-learning framework, called concept-free causal Meta-Graph (CC-Meta-Graph). We conduct extensive experiments to demonstrate the superiority of the proposed models: CCVGAE and CC-Meta-Graph, reaching up to $29\%$ and $11\%$ absolute improvements over baselines in terms of AUC, respectively. △ Less

Submitted 17 November, 2023; originally announced November 2023.

arXiv:2311.08434 [pdf, other]

Uplift Modeling based on Graph Neural Network Combined with Causal Knowledge

Authors: Haowen Wang, Xinyan Ye, Yangze Zhou, Zhiyi Zhang, Longhan Zhang, Jing Jiang

Abstract: Uplift modeling is a fundamental component of marketing effect modeling, which is commonly employed to evaluate the effects of treatments on outcomes. Through uplift modeling, we can identify the treatment with the greatest benefit. On the other side, we can identify clients who are likely to make favorable decisions in response to a certain treatment. In the past, uplift modeling approaches relie… ▽ More Uplift modeling is a fundamental component of marketing effect modeling, which is commonly employed to evaluate the effects of treatments on outcomes. Through uplift modeling, we can identify the treatment with the greatest benefit. On the other side, we can identify clients who are likely to make favorable decisions in response to a certain treatment. In the past, uplift modeling approaches relied heavily on the difference-in-difference (DID) architecture, paired with a machine learning model as the estimation learner, while neglecting the link and confidential information between features. We proposed a framework based on graph neural networks that combine causal knowledge with an estimate of uplift value. Firstly, we presented a causal representation technique based on CATE (conditional average treatment effect) estimation and adjacency matrix structure learning. Secondly, we suggested a more scalable uplift modeling framework based on graph convolution networks for combining causal knowledge. Our findings demonstrate that this method works effectively for predicting uplift values, with small errors in typical simulated data, and its effectiveness has been verified in actual industry marketing data. △ Less

Submitted 14 November, 2023; originally announced November 2023.

Comments: 6 pages, 6 figures

arXiv:2310.17759 [pdf, other]

Optimal Guarantees for Algorithmic Reproducibility and Gradient Complexity in Convex Optimization

Authors: Liang Zhang, Junchi Yang, Amin Karbasi, Niao He

Abstract: Algorithmic reproducibility measures the deviation in outputs of machine learning algorithms upon minor changes in the training process. Previous work suggests that first-order methods would need to trade-off convergence rate (gradient complexity) for better reproducibility. In this work, we challenge this perception and demonstrate that both optimal reproducibility and near-optimal convergence gu… ▽ More Algorithmic reproducibility measures the deviation in outputs of machine learning algorithms upon minor changes in the training process. Previous work suggests that first-order methods would need to trade-off convergence rate (gradient complexity) for better reproducibility. In this work, we challenge this perception and demonstrate that both optimal reproducibility and near-optimal convergence guarantees can be achieved for smooth convex minimization and smooth convex-concave minimax problems under various error-prone oracle settings. Particularly, given the inexact initialization oracle, our regularization-based algorithms achieve the best of both worlds - optimal reproducibility and near-optimal gradient complexity - for minimization and minimax optimization. With the inexact gradient oracle, the near-optimal guarantees also hold for minimax optimization. Additionally, with the stochastic gradient oracle, we show that stochastic gradient descent ascent is optimal in terms of both reproducibility and gradient complexity. We believe our results contribute to an enhanced understanding of the reproducibility-convergence trade-off in the context of convex optimization. △ Less

Submitted 9 January, 2024; v1 submitted 26 October, 2023; originally announced October 2023.

Comments: NeurIPS 2023 Spotlight

arXiv:2310.16260 [pdf, other]

Private Estimation and Inference in High-Dimensional Regression with FDR Control

Authors: Zhanrui Cai, Sai Li, Xintao Xia, Linjun Zhang

Abstract: This paper presents novel methodologies for conducting practical differentially private (DP) estimation and inference in high-dimensional linear regression. We start by proposing a differentially private Bayesian Information Criterion (BIC) for selecting the unknown sparsity parameter in DP-Lasso, eliminating the need for prior knowledge of model sparsity, a requisite in the existing literature. T… ▽ More This paper presents novel methodologies for conducting practical differentially private (DP) estimation and inference in high-dimensional linear regression. We start by proposing a differentially private Bayesian Information Criterion (BIC) for selecting the unknown sparsity parameter in DP-Lasso, eliminating the need for prior knowledge of model sparsity, a requisite in the existing literature. Then we propose a differentially private debiased LASSO algorithm that enables privacy-preserving inference on regression parameters. Our proposed method enables accurate and private inference on the regression parameters by leveraging the inherent sparsity of high-dimensional linear regression models. Additionally, we address the issue of multiple testing in high-dimensional linear regression by introducing a differentially private multiple testing procedure that controls the false discovery rate (FDR). This allows for accurate and privacy-preserving identification of significant predictors in the regression model. Through extensive simulations and real data analysis, we demonstrate the efficacy of our proposed methods in conducting inference for high-dimensional linear models while safeguarding privacy and controlling the FDR. △ Less

Submitted 24 October, 2023; originally announced October 2023.

arXiv:2310.15454 [pdf, other]

Private Learning with Public Features

Authors: Walid Krichene, Nicolas Mayoraz, Steffen Rendle, Shuang Song, Abhradeep Thakurta, Li Zhang

Abstract: We study a class of private learning problems in which the data is a join of private and public features. This is often the case in private personalization tasks such as recommendation or ad prediction, in which features related to individuals are sensitive, while features related to items (the movies or songs to be recommended, or the ads to be shown to users) are publicly available and do not re… ▽ More We study a class of private learning problems in which the data is a join of private and public features. This is often the case in private personalization tasks such as recommendation or ad prediction, in which features related to individuals are sensitive, while features related to items (the movies or songs to be recommended, or the ads to be shown to users) are publicly available and do not require protection. A natural question is whether private algorithms can achieve higher utility in the presence of public features. We give a positive answer for multi-encoder models where one of the encoders operates on public features. We develop new algorithms that take advantage of this separation by only protecting certain sufficient statistics (instead of adding noise to the gradient). This method has a guaranteed utility improvement for linear regression, and importantly, achieves the state of the art on two standard private recommendation benchmarks, demonstrating the importance of methods that adapt to the private-public feature separation. △ Less

Submitted 23 October, 2023; originally announced October 2023.

arXiv:2310.09639 [pdf, other]

DPZero: Private Fine-Tuning of Language Models without Backpropagation

Authors: Liang Zhang, Bingcong Li, Kiran Koshy Thekumparampil, Sewoong Oh, Niao He

Abstract: The widespread practice of fine-tuning large language models (LLMs) on domain-specific data faces two major challenges in memory and privacy. First, as the size of LLMs continues to grow, the memory demands of gradient-based training methods via backpropagation become prohibitively high. Second, given the tendency of LLMs to memorize training data, it is important to protect potentially sensitive… ▽ More The widespread practice of fine-tuning large language models (LLMs) on domain-specific data faces two major challenges in memory and privacy. First, as the size of LLMs continues to grow, the memory demands of gradient-based training methods via backpropagation become prohibitively high. Second, given the tendency of LLMs to memorize training data, it is important to protect potentially sensitive information in the fine-tuning data from being regurgitated. Zeroth-order methods, which rely solely on forward passes, substantially reduce memory consumption during training. However, directly combining them with standard differentially private gradient descent suffers more as model size grows. To bridge this gap, we introduce DPZero, a novel private zeroth-order algorithm with nearly dimension-independent rates. The memory efficiency of DPZero is demonstrated in privately fine-tuning RoBERTa and OPT on several downstream tasks. Our code is available at https://github.com/Liang137/DPZero. △ Less

Submitted 6 June, 2024; v1 submitted 14 October, 2023; originally announced October 2023.

Comments: ICML 2024

arXiv:2310.02507 [pdf, other]

Inference of Sample Complier Average Causal Effects under Experiments with Completely Randomized Design and Computer Assisted Balance-Improving Designs

Authors: Zhen Zhong, Per Johansson, Junni L. Zhang

Abstract: Non-compliance is common in real world experiments. We focus on inference about the sample complier average causal effect, that is, the average treatment effect for experimental units who are compliers. We present three types of inference strategies for the sample complier average causal effect: the Wald estimator, regression adjustment estimators and model-based Bayesian inference. Because modern… ▽ More Non-compliance is common in real world experiments. We focus on inference about the sample complier average causal effect, that is, the average treatment effect for experimental units who are compliers. We present three types of inference strategies for the sample complier average causal effect: the Wald estimator, regression adjustment estimators and model-based Bayesian inference. Because modern computer assisted experimental designs have been used to improve covariate balance over complete randomization, we discuss inference under both complete randomization and a specific computer assisted experimental design - Mahalanobis distance based rerandomization, under which asymptotic properties of the Wald estimator and regression adjustment estimators can be derived. We use Monte Carlo simulation to compare the finite sample performance of the methods under both experimental designs. We find that under either design, the Bayesian method performs the best because it is stable, it yields smallest median absolute error and smallest median interval length. The improvement by the Bayesian method is especially large when the fraction of compliers is small. We present an application to a job training experiment with non-compliance. △ Less

Submitted 3 October, 2023; originally announced October 2023.

Comments: 42 pages, 2 figures

arXiv:2309.17262 [pdf, other]

Estimation and Inference in Distributional Reinforcement Learning

Authors: Liangyu Zhang, Yang Peng, Jiadong Liang, Wenhao Yang, Zhihua Zhang

Abstract: In this paper, we study distributional reinforcement learning from the perspective of statistical efficiency. We investigate distributional policy evaluation, aiming to estimate the complete distribution of the random return (denoted $η^π$) attained by a given policy $π$. We use the certainty-equivalence method to construct our estimator $\hatη^π$, given a generative model is available. We s… ▽ More In this paper, we study distributional reinforcement learning from the perspective of statistical efficiency. We investigate distributional policy evaluation, aiming to estimate the complete distribution of the random return (denoted $η^π$) attained by a given policy $π$. We use the certainty-equivalence method to construct our estimator $\hatη^π$, given a generative model is available. We show that in this circumstance we need a dataset of size $\widetilde O\left(\frac{|\mathcal{S}||\mathcal{A}|}{ε^{2p}(1-γ)^{2p+2}}\right)$ to guarantee a $p$-Wasserstein metric between $\hatη^π$ and $η^π$ is less than $ε$ with high probability. This implies the distributional policy evaluation problem can be solved with sample efficiency. Also, we show that under different mild assumptions a dataset of size $\widetilde O\left(\frac{|\mathcal{S}||\mathcal{A}|}{ε^{2}(1-γ)^{4}}\right)$ suffices to ensure the Kolmogorov metric and total variation metric between $\hatη^π$ and $η^π$ is below $ε$ with high probability. Furthermore, we investigate the asymptotic behavior of $\hatη^π$. We demonstrate that the ``empirical process'' $\sqrt{n}(\hatη^π-η^π)$ converges weakly to a Gaussian process in the space of bounded functionals on Lipschitz function class $\ell^\infty(\mathcal{F}_{W_1})$, also in the space of bounded functionals on indicator function class $\ell^\infty(\mathcal{F}_{\mathrm{KS}})$ and bounded measurable function class $\ell^\infty(\mathcal{F}_{\mathrm{TV}})$ when some mild conditions hold. Our findings give rise to a unified approach to statistical inference of a wide class of statistical functionals of $η^π$. △ Less

Submitted 29 September, 2023; originally announced September 2023.

arXiv:2309.09555 [pdf, other]

Multi-dimensional domain generalization with low-rank structures

Authors: Sai Li, Linjun Zhang

Abstract: In conventional statistical and machine learning methods, it is typically assumed that the test data are identically distributed with the training data. However, this assumption does not always hold, especially in applications where the target population are not well-represented in the training data. This is a notable issue in health-related studies, where specific ethnic populations may be underr… ▽ More In conventional statistical and machine learning methods, it is typically assumed that the test data are identically distributed with the training data. However, this assumption does not always hold, especially in applications where the target population are not well-represented in the training data. This is a notable issue in health-related studies, where specific ethnic populations may be underrepresented, posing a significant challenge for researchers aiming to make statistical inferences about these minority groups. In this work, we present a novel approach to addressing this challenge in linear regression models. We organize the model parameters for all the sub-populations into a tensor. By studying a structured tensor completion problem, we can achieve robust domain generalization, i.e., learning about sub-populations with limited or no available data. Our method novelly leverages the structure of group labels and it can produce more reliable and interpretable generalization results. We establish rigorous theoretical guarantees for the proposed method and demonstrate its minimax optimality. To validate the effectiveness of our approach, we conduct extensive numerical experiments and a real data study focused on education level prediction for multiple ethnic groups, comparing our results with those obtained using other existing methods. △ Less

Submitted 18 September, 2023; originally announced September 2023.

Showing 1–50 of 397 results for author: Zhang, L