Search | arXiv e-print repository

arXiv:2403.12624 [pdf, other]

Large-scale metric objects filtering for binary classification with application to abnormal brain connectivity detection

Authors: Shuaida He, Jiaqi Li, Xin Chen

Abstract: The classification of random objects within metric spaces without a vector structure has attracted increasing attention. However, the complexity inherent in such non-Euclidean data often restricts existing models to handle only a limited number of features, leaving a gap in real-world applications. To address this, we propose a data-adaptive filtering procedure to identify informative features fro… ▽ More The classification of random objects within metric spaces without a vector structure has attracted increasing attention. However, the complexity inherent in such non-Euclidean data often restricts existing models to handle only a limited number of features, leaving a gap in real-world applications. To address this, we propose a data-adaptive filtering procedure to identify informative features from large-scale random objects, leveraging a novel Kolmogorov-Smirnov-type statistic defined on the metric space. Our method, applicable to data in general metric spaces with binary labels, exhibits remarkable flexibility. It enjoys a model-free property, as its implementation does not rely on any specified classifier. Theoretically, it controls the false discovery rate while guaranteeing the sure screening property. Empirically, equipped with a Wasserstein metric, it demonstrates superior sample performance compared to Euclidean competitors. When applied to analyze a dataset on autism, our method identifies significant brain regions associated with the condition. Moreover, it reveals distinct interaction patterns among these regions between individuals with and without autism, achieved by filtering hundreds of thousands of covariance matrices representing various brain connectivities. △ Less

Submitted 19 March, 2024; originally announced March 2024.

arXiv:2402.05438 [pdf, other]

Penalized spline estimation of principal components for sparse functional data: rates of convergence

Authors: Shiyuan He, Jianhua Z. Huang, Kejun He

Abstract: This paper gives a comprehensive treatment of the convergence rates of penalized spline estimators for simultaneously estimating several leading principal component functions, when the functional data is sparsely observed. The penalized spline estimators are defined as the solution of a penalized empirical risk minimization problem, where the loss function belongs to a general class of loss functi… ▽ More This paper gives a comprehensive treatment of the convergence rates of penalized spline estimators for simultaneously estimating several leading principal component functions, when the functional data is sparsely observed. The penalized spline estimators are defined as the solution of a penalized empirical risk minimization problem, where the loss function belongs to a general class of loss functions motivated by the matrix Bregman divergence, and the penalty term is the integrated squared derivative. The theory reveals that the asymptotic behavior of penalized spline estimators depends on the interesting interplay between several factors, i.e., the smoothness of the unknown functions, the spline degree, the spline knot number, the penalty order, and the penalty parameter. The theory also classifies the asymptotic behavior into seven scenarios and characterizes whether and how the minimax optimal rates of convergence are achievable in each scenario. △ Less

Submitted 8 February, 2024; originally announced February 2024.

arXiv:2401.04723 [pdf, other]

Spatio-temporal data fusion for the analysis of in situ and remote sensing data using the INLA-SPDE approach

Authors: Shiyu He, Samuel W. K. Wong

Abstract: We propose a Bayesian hierarchical model to address the challenge of spatial misalignment in spatio-temporal data obtained from in situ and satellite sources. The model is fit using the INLA-SPDE approach, which provides efficient computation. Our methodology combines the different data sources in a "fusion"" model via the construction of projection matrices in both spatial and temporal domains. T… ▽ More We propose a Bayesian hierarchical model to address the challenge of spatial misalignment in spatio-temporal data obtained from in situ and satellite sources. The model is fit using the INLA-SPDE approach, which provides efficient computation. Our methodology combines the different data sources in a "fusion"" model via the construction of projection matrices in both spatial and temporal domains. Through simulation studies, we demonstrate that the fusion model has superior performance in prediction accuracy across space and time compared to standalone "in situ" and "satellite" models based on only in situ or satellite data, respectively. The fusion model also generally outperforms the standalone models in terms of parameter inference. Such a modeling approach is motivated by environmental problems, and our specific focus is on the analysis and prediction of harmful algae bloom (HAB) events, where the convention is to conduct separate analyses based on either in situ samples or satellite images. A real data analysis shows that the proposed model is a necessary step towards a unified characterization of bloom dynamics and identifying the key drivers of HAB events. △ Less

Submitted 9 January, 2024; originally announced January 2024.

Comments: 23 pages, 7 figures

arXiv:2401.00667 [pdf, other]

Channelling Multimodality Through a Unimodalizing Transport: Warp-U Sampler and Stochastic Bridge Sampling

Authors: Fei Ding, David E. Jones, Shiyuan He, Xiao-Li Meng

Abstract: Monte Carlo integration is fundamental in scientific and statistical computation, but requires reliable samples from the target distribution, which poses a substantial challenge in the case of multi-modal distributions. Existing methods often involve time-consuming tuning, and typically lack tailored estimators for efficient use of the samples. This paper adapts the Warp-U transformation [Wang et… ▽ More Monte Carlo integration is fundamental in scientific and statistical computation, but requires reliable samples from the target distribution, which poses a substantial challenge in the case of multi-modal distributions. Existing methods often involve time-consuming tuning, and typically lack tailored estimators for efficient use of the samples. This paper adapts the Warp-U transformation [Wang et al., 2022] to form multi-modal sampling strategy called Warp-U sampling. It constructs a stochastic map to transport a multi-modal density into a uni-modal one, and subsequently inverts the transport but with new stochasticity injected. For efficient use of the samples for normalising constant estimation, we propose (i) an unbiased estimation scheme based coupled chains, where the Warp-U sampling is used to reduce the coupling time; and (ii) a stochastic Warp-U bridge sampling estimator, which improves its deterministic counterpart given in Wang et al. [2022]. Our overall approach requires less tuning and is easier to apply than common alternatives. Theoretically, we establish the ergodicity of our sampling algorithm and that our stochastic Warp-U bridge sampling estimator has greater (asymptotic) precision per CPU second compared to the Warp-U bridge estimator of Wang et al. [2022] under practical conditions. The advantages and current limitations of our approach are demonstrated through simulation studies and an application to exoplanet detection. △ Less

Submitted 1 January, 2024; originally announced January 2024.

arXiv:2312.13875 [pdf, other]

Best Arm Identification in Batched Multi-armed Bandit Problems

Authors: Shengyu Cao, Simai He, Ruoqing Jiang, Jin Xu, Hongsong Yuan

Abstract: Recently multi-armed bandit problem arises in many real-life scenarios where arms must be sampled in batches, due to limited time the agent can wait for the feedback. Such applications include biological experimentation and online marketing. The problem is further complicated when the number of arms is large and the number of batches is small. We consider pure exploration in a batched multi-armed… ▽ More Recently multi-armed bandit problem arises in many real-life scenarios where arms must be sampled in batches, due to limited time the agent can wait for the feedback. Such applications include biological experimentation and online marketing. The problem is further complicated when the number of arms is large and the number of batches is small. We consider pure exploration in a batched multi-armed bandit problem. We introduce a general linear programming framework that can incorporate objectives of different theoretical settings in best arm identification. The linear program leads to a two-stage algorithm that can achieve good theoretical properties. We demonstrate by numerical studies that the algorithm also has good performance compared to certain UCB-type or Thompson sampling methods. △ Less

Submitted 21 December, 2023; originally announced December 2023.

arXiv:2311.03497 [pdf, other]

Understanding the Impact of Seasonal Climate Change on Canada's Economy by Region and Sector

Authors: Shiyu He, Trang Bui, Yuying Huang, Wenling Zhang, Jie Jian, Samuel W. K. Wong, Tony S. Wirjanto

Abstract: To assess the impact of climate change on the Canadian economy, we investigate and model the relationship between seasonal climate variables and economic growth across provinces and economic sectors. We further provide projections of climate change impacts up to the year 2050, taking into account the diverse climate change patterns and economic conditions across Canada. Our results indicate that r… ▽ More To assess the impact of climate change on the Canadian economy, we investigate and model the relationship between seasonal climate variables and economic growth across provinces and economic sectors. We further provide projections of climate change impacts up to the year 2050, taking into account the diverse climate change patterns and economic conditions across Canada. Our results indicate that rising Fall temperature anomalies have a notable adverse impact on Canadian economic growth. Province-wide, Saskatchewan and Manitoba are anticipated to experience the most substantial declines, whereas British Columbia and the Maritime provinces will be less impacted. Industry-wide, Mining is projected to see the greatest benefits, while Agriculture and Manufacturing are projected to have the most significant downturns. The disparities of climate change effects between provinces and industries highlight the need for governments to tailor their policies accordingly, and offer targeted assistance to regions and industries that are particularly vulnerable in the face of climate change. Targeted approaches to climate change mitigation are likely to be more effective than one-size-fits-all policies for the whole economy. △ Less

Submitted 6 November, 2023; originally announced November 2023.

Comments: 25 pages, 7 figures

arXiv:2306.06324 [pdf, other]

Differentially private sliced inverse regression in the federated paradigm

Authors: Shuaida He, Jiarui Zhang, Xin Chen

Abstract: Sliced inverse regression (SIR), which includes linear discriminant analysis (LDA) as a special case, is a popular and powerful dimension reduction tool. In this article, we extend SIR to address the challenges of decentralized data, prioritizing privacy and communication efficiency. Our approach, named as federated sliced inverse regression (FSIR), facilitates collaborative estimation of the suff… ▽ More Sliced inverse regression (SIR), which includes linear discriminant analysis (LDA) as a special case, is a popular and powerful dimension reduction tool. In this article, we extend SIR to address the challenges of decentralized data, prioritizing privacy and communication efficiency. Our approach, named as federated sliced inverse regression (FSIR), facilitates collaborative estimation of the sufficient dimension reduction subspace among multiple clients, solely sharing local estimates to protect sensitive datasets from exposure. To guard against potential adversary attacks, FSIR further employs diverse perturbation strategies, including a novel vectorized Gaussian mechanism that guarantees differential privacy at a low cost of statistical accuracy. Additionally, FSIR naturally incorporates a collaborative variable screening step, enabling effective handling of high-dimensional client data. Theoretical properties of FSIR are established for both low-dimensional and high-dimensional settings, supported by extensive numerical experiments and real data analysis. △ Less

Submitted 10 August, 2023; v1 submitted 9 June, 2023; originally announced June 2023.

arXiv:2302.08439 [pdf, other]

doi 10.1080/00401706.2023.2197471

Bayesian Nonlinear Tensor Regression with Functional Fused Elastic Net Prior

Authors: Shuoli Chen, Kejun He, Shiyuan He, Yang Ni, Raymond K. W. Wong

Abstract: Tensor regression methods have been widely used to predict a scalar response from covariates in the form of a multiway array. In many applications, the regions of tensor covariates used for prediction are often spatially connected with unknown shapes and discontinuous jumps on the boundaries. Moreover, the relationship between the response and the tensor covariates can be nonlinear. In this articl… ▽ More Tensor regression methods have been widely used to predict a scalar response from covariates in the form of a multiway array. In many applications, the regions of tensor covariates used for prediction are often spatially connected with unknown shapes and discontinuous jumps on the boundaries. Moreover, the relationship between the response and the tensor covariates can be nonlinear. In this article, we develop a nonlinear Bayesian tensor additive regression model to accommodate such spatial structure. A functional fused elastic net prior is proposed over the additive component functions to comprehensively model the nonlinearity and spatial smoothness, detect the discontinuous jumps, and simultaneously identify the active regions. The great flexibility and interpretability of the proposed method against the alternatives are demonstrated by a simulation study and an analysis on facial feature data. △ Less

Submitted 16 February, 2023; originally announced February 2023.

Journal ref: Technometrics, 65:4, 524-536 (2023)

arXiv:2211.04874 [pdf, other]

A Unified Analysis of Multi-task Functional Linear Regression Models with Manifold Constraint and Composite Quadratic Penalty

Authors: Shiyuan He, Hanxuan Ye, Kejun He

Abstract: This work studies the multi-task functional linear regression models where both the covariates and the unknown regression coefficients (called slope functions) are curves. For slope function estimation, we employ penalized splines to balance bias, variance, and computational complexity. The power of multi-task learning is brought in by imposing additional structures over the slope functions. We pr… ▽ More This work studies the multi-task functional linear regression models where both the covariates and the unknown regression coefficients (called slope functions) are curves. For slope function estimation, we employ penalized splines to balance bias, variance, and computational complexity. The power of multi-task learning is brought in by imposing additional structures over the slope functions. We propose a general model with double regularization over the spline coefficient matrix: i) a matrix manifold constraint, and ii) a composite penalty as a summation of quadratic terms. Many multi-task learning approaches can be treated as special cases of this proposed model, such as a reduced-rank model and a graph Laplacian regularized model. We show the composite penalty induces a specific norm, which helps to quantify the manifold curvature and determine the corresponding proper subset in the manifold tangent space. The complexity of tangent space subset is then bridged to the complexity of geodesic neighbor via generic chaining. A unified convergence upper bound is obtained and specifically applied to the reduced-rank model and the graph Laplacian regularized model. The phase transition behaviors for the estimators are examined as we vary the configurations of model parameters. △ Less

Submitted 31 July, 2023; v1 submitted 9 November, 2022; originally announced November 2022.

arXiv:2211.04784 [pdf, other]

Spline Estimation of Functional Principal Components via Manifold Conjugate Gradient Algorithm

Authors: Shiyuan He, Hanxuan Ye, Kejun He

Abstract: Functional principal component analysis has become the most important dimension reduction technique in functional data analysis. Based on B-spline approximation, functional principal components (FPCs) can be efficiently estimated by the expectation-maximization (EM) and the geometric restricted maximum likelihood (REML) algorithms under the strong assumption of Gaussianity on the principal compone… ▽ More Functional principal component analysis has become the most important dimension reduction technique in functional data analysis. Based on B-spline approximation, functional principal components (FPCs) can be efficiently estimated by the expectation-maximization (EM) and the geometric restricted maximum likelihood (REML) algorithms under the strong assumption of Gaussianity on the principal component scores and observational errors. When computing the solution, the EM algorithm does not exploit the underlying geometric manifold structure, while the performance of REML is known to be unstable. In this article, we propose a conjugate gradient algorithm over the product manifold to estimate FPCs. This algorithm exploits the manifold geometry structure of the overall parameter space, thus improving its search efficiency and estimation accuracy. In addition, a distribution-free interpretation of the loss function is provided from the viewpoint of matrix Bregman divergence, which explains why the proposed method works well under general distribution settings. We also show that a roughness penalization can be easily incorporated into our algorithm with a potentially better fit. The appealing numerical performance of the proposed method is demonstrated by simulation studies and the analysis of a Type Ia supernova light curve dataset. △ Less

Submitted 9 November, 2022; originally announced November 2022.

arXiv:2210.10887 [pdf, other]

doi 10.1109/IROS45743.2020.9341481

Data-Driven Distributionally Robust Electric Vehicle Balancing for Mobility-on-Demand Systems under Demand and Supply Uncertainties

Authors: Sihong He, Lynn Pepin, Guang Wang, Desheng Zhang, Fei Miao

Abstract: As electric vehicle (EV) technologies become mature, EV has been rapidly adopted in modern transportation systems, and is expected to provide future autonomous mobility-on-demand (AMoD) service with economic and societal benefits. However, EVs require frequent recharges due to their limited and unpredictable cruising ranges, and they have to be managed efficiently given the dynamic charging proces… ▽ More As electric vehicle (EV) technologies become mature, EV has been rapidly adopted in modern transportation systems, and is expected to provide future autonomous mobility-on-demand (AMoD) service with economic and societal benefits. However, EVs require frequent recharges due to their limited and unpredictable cruising ranges, and they have to be managed efficiently given the dynamic charging process. It is urgent and challenging to investigate a computationally efficient algorithm that provide EV AMoD system performance guarantees under model uncertainties, instead of using heuristic demand or charging models. To accomplish this goal, this work designs a data-driven distributionally robust optimization approach for vehicle supply-demand ratio and charging station utilization balancing, while minimizing the worst-case expected cost considering both passenger mobility demand uncertainties and EV supply uncertainties. We then derive an equivalent computationally tractable form for solving the distributionally robust problem in a computationally efficient way under ellipsoid uncertainty sets constructed from data. Based on E-taxi system data of Shenzhen city, we show that the average total balancing cost is reduced by 14.49%, the average unfairness of supply-demand ratio and utilization is reduced by 15.78% and 34.51% respectively with the distributionally robust vehicle balancing method, compared with solutions which do not consider model uncertainties. △ Less

Submitted 19 October, 2022; originally announced October 2022.

Comments: This paper has been published in IROS2020

arXiv:2206.05891 [pdf, other]

Anchor Sampling for Federated Learning with Partial Client Participation

Authors: Feijie Wu, Song Guo, Zhihao Qu, Shiqi He, Ziming Liu, Jing Gao

Abstract: Compared with full client participation, partial client participation is a more practical scenario in federated learning, but it may amplify some challenges in federated learning, such as data heterogeneity. The lack of inactive clients' updates in partial client participation makes it more likely for the model aggregation to deviate from the aggregation based on full client participation. Trainin… ▽ More Compared with full client participation, partial client participation is a more practical scenario in federated learning, but it may amplify some challenges in federated learning, such as data heterogeneity. The lack of inactive clients' updates in partial client participation makes it more likely for the model aggregation to deviate from the aggregation based on full client participation. Training with large batches on individual clients is proposed to address data heterogeneity in general, but their effectiveness under partial client participation is not clear. Motivated by these challenges, we propose to develop a novel federated learning framework, referred to as FedAMD, for partial client participation. The core idea is anchor sampling, which separates partial participants into anchor and miner groups. Each client in the anchor group aims at the local bullseye with the gradient computation using a large batch. Guided by the bullseyes, clients in the miner group steer multiple near-optimal local updates using small batches and update the global model. By integrating the results of the two groups, FedAMD is able to accelerate the training process and improve the model performance. Measured by $ε$-approximation and compared to the state-of-the-art methods, FedAMD achieves the convergence by up to $O(1/ε)$ fewer communication rounds under non-convex objectives. Empirical studies on real-world datasets validate the effectiveness of FedAMD and demonstrate the superiority of the proposed algorithm: Not only does it considerably save computation and communication costs, but also the test accuracy significantly improves. △ Less

Submitted 28 May, 2023; v1 submitted 12 June, 2022; originally announced June 2022.

Comments: ICML 2023

arXiv:2111.08269 [pdf, other]

Data-Driven Inpatient Bed Assignment Using the P Model

Authors: Shasha Han, Shuangchi He, Hong Choon Oh

Abstract: Problem definition: Emergency department (ED) boarding refers to the practice of holding patients in the ED after they have been admitted to hospital wards, usually resulting from insufficient inpatient resources. Boarded patients may compete with new patients for medical resources in the ED, compromising the quality of emergency care. A common expedient for mitigating boarding is patient overflow… ▽ More Problem definition: Emergency department (ED) boarding refers to the practice of holding patients in the ED after they have been admitted to hospital wards, usually resulting from insufficient inpatient resources. Boarded patients may compete with new patients for medical resources in the ED, compromising the quality of emergency care. A common expedient for mitigating boarding is patient overflowing, i.e., sending patients to beds in other specialties or accommodation classes, which may compromise the quality of inpatient care and bring on operational challenges. We study inpatient bed assignment to shorten boarding times without excessive patient overflowing. Methodology: We use a queue with multiple customer classes and multiple server pools to model hospital wards. Exploiting patient flow data from a hospital, we propose a computationally tractable approach to formulating the bed assignment problem, where the joint probability of all waiting patients meeting their respective delay targets is maximized. Results: By dynamically adjusting the overflow rate, the proposed approach is capable not only of reducing patients' waiting times, but also of mitigating the time-of-day effect on boarding times. In numerical experiments, our approach greatly outperforms both early discharge policies and threshold-based overflowing policies, which are commonly used in practice. Managerial implications: We provide a practicable approach to solving the bed assignment problem. This data-driven approach captures critical features of patient flow management, while the resulting optimization problem is practically solvable. The proposed approach is a useful tool for the control of queueing systems with time-sensitive service requirements. △ Less

Submitted 16 November, 2021; originally announced November 2021.

arXiv:2111.06859 [pdf, other]

Higher-Order Coverage Errors of Batching Methods via Edgeworth Expansions on $t$-Statistics

Authors: Shengyi He, Henry Lam

Abstract: While batching methods have been widely used in simulation and statistics, it is open regarding their higher-order coverage behaviors and whether one variant is better than the others in this regard. We develop techniques to obtain higher-order coverage errors for batching methods by building Edgeworth-type expansions on $t$-statistics. The coefficients in these expansions are intricate analytical… ▽ More While batching methods have been widely used in simulation and statistics, it is open regarding their higher-order coverage behaviors and whether one variant is better than the others in this regard. We develop techniques to obtain higher-order coverage errors for batching methods by building Edgeworth-type expansions on $t$-statistics. The coefficients in these expansions are intricate analytically, but we provide algorithms to estimate the coefficients of the $n^{-1}$ error term via Monte Carlo simulation. We provide insights on the effect of the number of batches on the coverage error where we demonstrate generally non-monotonic relations. We also compare different batching methods both theoretically and numerically, and argue that none of the methods is uniformly better than the others in terms of coverage. However, when the number of batches is large, sectioned jackknife has the best coverage among all. △ Less

Submitted 12 November, 2021; originally announced November 2021.

arXiv:2111.02204 [pdf, other]

Certifiable Deep Importance Sampling for Rare-Event Simulation of Black-Box Systems

Authors: Mansur Arief, Yuanlu Bai, Wenhao Ding, Shengyi He, Zhiyuan Huang, Henry Lam, Ding Zhao

Abstract: Rare-event simulation techniques, such as importance sampling (IS), constitute powerful tools to speed up challenging estimation of rare catastrophic events. These techniques often leverage the knowledge and analysis on underlying system structures to endow desirable efficiency guarantees. However, black-box problems, especially those arising from recent safety-critical applications of AI-driven p… ▽ More Rare-event simulation techniques, such as importance sampling (IS), constitute powerful tools to speed up challenging estimation of rare catastrophic events. These techniques often leverage the knowledge and analysis on underlying system structures to endow desirable efficiency guarantees. However, black-box problems, especially those arising from recent safety-critical applications of AI-driven physical systems, can fundamentally undermine their efficiency guarantees and lead to dangerous under-estimation without diagnostically detected. We propose a framework called Deep Probabilistic Accelerated Evaluation (Deep-PrAE) to design statistically guaranteed IS, by converting black-box samplers that are versatile but could lack guarantees, into one with what we call a relaxed efficiency certificate that allows accurate estimation of bounds on the rare-event probability. We present the theory of Deep-PrAE that combines the dominating point concept with rare-event set learning via deep neural network classifiers, and demonstrate its effectiveness in numerical examples including the safety-testing of intelligent driving algorithms. △ Less

Submitted 3 November, 2021; originally announced November 2021.

Comments: The conference version of this paper has appeared in AISTATS 2021 (arXiv:2006.15722)

arXiv:2110.07531 [pdf]

Deep learning models for predicting RNA degradation via dual crowdsourcing

Authors: Hannah K. Wayment-Steele, Wipapat Kladwang, Andrew M. Watkins, Do Soon Kim, Bojan Tunguz, Walter Reade, Maggie Demkin, Jonathan Romano, Roger Wellington-Oguri, John J. Nicol, Jiayang Gao, Kazuki Onodera, Kazuki Fujikawa, Hanfei Mao, Gilles Vandewiele, Michele Tinti, Bram Steenwinckel, Takuya Ito, Taiga Noumi, Shujun He, Keiichiro Ishi, Youhan Lee, Fatih Öztürk, Anthony Chiu, Emin Öztürk , et al. (4 additional authors not shown)

Abstract: Messenger RNA-based medicines hold immense potential, as evidenced by their rapid deployment as COVID-19 vaccines. However, worldwide distribution of mRNA molecules has been limited by their thermostability, which is fundamentally limited by the intrinsic instability of RNA molecules to a chemical degradation reaction called in-line hydrolysis. Predicting the degradation of an RNA molecule is a ke… ▽ More Messenger RNA-based medicines hold immense potential, as evidenced by their rapid deployment as COVID-19 vaccines. However, worldwide distribution of mRNA molecules has been limited by their thermostability, which is fundamentally limited by the intrinsic instability of RNA molecules to a chemical degradation reaction called in-line hydrolysis. Predicting the degradation of an RNA molecule is a key task in designing more stable RNA-based therapeutics. Here, we describe a crowdsourced machine learning competition ("Stanford OpenVaccine") on Kaggle, involving single-nucleotide resolution measurements on 6043 102-130-nucleotide diverse RNA constructs that were themselves solicited through crowdsourcing on the RNA design platform Eterna. The entire experiment was completed in less than 6 months, and 41% of nucleotide-level predictions from the winning model were within experimental error of the ground truth measurement. Furthermore, these models generalized to blindly predicting orthogonal degradation data on much longer mRNA molecules (504-1588 nucleotides) with improved accuracy compared to previously published models. Top teams integrated natural language processing architectures and data augmentation techniques with predictions from previous dynamic programming models for RNA secondary structure. These results indicate that such models are capable of representing in-line hydrolysis with excellent accuracy, supporting their use for designing stabilized messenger RNAs. The integration of two crowdsourcing platforms, one for data set creation and another for machine learning, may be fruitful for other urgent problems that demand scientific discovery on rapid timescales. △ Less

Submitted 22 April, 2022; v1 submitted 14 October, 2021; originally announced October 2021.

arXiv:2110.04232 [pdf, other]

Learning Topic Models: Identifiability and Finite-Sample Analysis

Authors: Yinyin Chen, Shishuang He, Yun Yang, Feng Liang

Abstract: Topic models provide a useful text-mining tool for learning, extracting, and discovering latent structures in large text corpora. Although a plethora of methods have been proposed for topic modeling, lacking in the literature is a formal theoretical investigation of the statistical identifiability and accuracy of latent topic estimation. In this paper, we propose a maximum likelihood estimator (ML… ▽ More Topic models provide a useful text-mining tool for learning, extracting, and discovering latent structures in large text corpora. Although a plethora of methods have been proposed for topic modeling, lacking in the literature is a formal theoretical investigation of the statistical identifiability and accuracy of latent topic estimation. In this paper, we propose a maximum likelihood estimator (MLE) of latent topics based on a specific integrated likelihood that is naturally connected to the concept, in computational geometry, of volume minimization. Our theory introduces a new set of geometric conditions for topic model identifiability, conditions that are weaker than conventional separability conditions, which typically rely on the existence of pure topic documents or of anchor words. Weaker conditions allow a wider and thus potentially more fruitful investigation. We conduct finite-sample error analysis for the proposed estimator and discuss connections between our results and those of previous investigations. We conclude with empirical studies employing both simulated and real datasets. △ Less

Submitted 10 August, 2022; v1 submitted 8 October, 2021; originally announced October 2021.

arXiv:2109.03397 [pdf, other]

Functional Principal Subspace Sampling for Large Scale Functional Data Analysis

Authors: Shiyuan He, Xiaomeng Yan

Abstract: Functional data analysis (FDA) methods have computational and theoretical appeals for some high dimensional data, but lack the scalability to modern large sample datasets. To tackle the challenge, we develop randomized algorithms for two important FDA methods: functional principal component analysis (FPCA) and functional linear regression (FLR) with scalar response. The two methods are connected a… ▽ More Functional data analysis (FDA) methods have computational and theoretical appeals for some high dimensional data, but lack the scalability to modern large sample datasets. To tackle the challenge, we develop randomized algorithms for two important FDA methods: functional principal component analysis (FPCA) and functional linear regression (FLR) with scalar response. The two methods are connected as they both rely on the accurate estimation of functional principal subspace. The proposed algorithms draw subsamples from the large dataset at hand and apply FPCA or FLR over the subsamples to reduce the computational cost. To effectively preserve subspace information in the subsamples, we propose a functional principal subspace sampling probability, which removes the eigenvalue scale effect inside the functional principal subspace and properly weights the residual. Based on the operator perturbation analysis, we show the proposed probability has precise control over the first order error of the subspace projection operator and can be interpreted as an importance sampling for functional subspace estimation. Moreover, concentration bounds for the proposed algorithms are established to reflect the low intrinsic dimension nature of functional data in an infinite dimensional space. The effectiveness of the proposed algorithms is demonstrated upon synthetic and real datasets. △ Less

Submitted 8 April, 2022; v1 submitted 7 September, 2021; originally announced September 2021.

arXiv:2109.01993 [pdf, ps, other]

Statistical computation methods for microbiome compositional data network inference

Authors: Liang Chen, Qiuyan He, Hui Wan, Shun He, Minghua Deng

Abstract: Microbes can affect processes from food production to human health. Such microbes are not isolated, but rather interact with each other and establish connections with their living environments. Understanding these interactions is essential to an understanding of the organization and complex interplay of microbial communities, as well as the structure and dynamics of various ecosystems. A common an… ▽ More Microbes can affect processes from food production to human health. Such microbes are not isolated, but rather interact with each other and establish connections with their living environments. Understanding these interactions is essential to an understanding of the organization and complex interplay of microbial communities, as well as the structure and dynamics of various ecosystems. A common and essential approach toward this objective involves the inference of microbiome interaction networks. Although network inference methods in other fields have been studied before, applying these methods to estimate microbiome associations based on compositional data will not yield valid results. On the one hand, features of microbiome data such as compositionality, sparsity and high-dimensionality challenge the data normalization and the design of computational methods. On the other hand, several issues like microbial community heterogeneity, external environmental interference and biological concerns also make it more difficult to deal with the network inference. In this paper, we provide a comprehensive review of emerging microbiome interaction network inference methods. According to various assumptions and research targets, estimated networks are divided into four main categories: correlation networks, conditional correlation networks, mixture networks and differential networks. Their scope of applications, advantages and limitations are presented in this review. Since real microbial interactions can be complex and dynamic, no unifying method has captured all the aspects of interest to date. In addition, we discuss the challenges now confronting current microbial associations study and future prospects. Finally, we highlight that the research in microbial network inference requires the joint promotion of statistical computation methods and experimental techniques. △ Less

Submitted 5 September, 2021; originally announced September 2021.

arXiv:2108.05908 [pdf, ps, other]

Higher-Order Expansion and Bartlett Correctability of Distributionally Robust Optimization

Authors: Shengyi He, Henry Lam

Abstract: Distributionally robust optimization (DRO) is a worst-case framework for stochastic optimization under uncertainty that has drawn fast-growing studies in recent years. When the underlying probability distribution is unknown and observed from data, DRO suggests to compute the worst-case distribution within a so-called uncertainty set that captures the involved statistical uncertainty. In particular… ▽ More Distributionally robust optimization (DRO) is a worst-case framework for stochastic optimization under uncertainty that has drawn fast-growing studies in recent years. When the underlying probability distribution is unknown and observed from data, DRO suggests to compute the worst-case distribution within a so-called uncertainty set that captures the involved statistical uncertainty. In particular, DRO with uncertainty set constructed as a statistical divergence neighborhood ball has been shown to provide a tool for constructing valid confidence intervals for nonparametric functionals, and bears a duality with the empirical likelihood (EL). In this paper, we show how adjusting the ball size of such type of DRO can reduce higher-order coverage errors similar to the Bartlett correction. Our correction, which applies to general von Mises differentiable functionals, is more general than the existing EL literature that only focuses on smooth function models or $M$-estimation. Moreover, we demonstrate a higher-order "self-normalizing" property of DRO regardless of the choice of divergence. Our approach builds on the development of a higher-order expansion of DRO, which is obtained through an asymptotic analysis on a fixed point equation arising from the Karush-Kuhn-Tucker conditions. △ Less

Submitted 11 August, 2021; originally announced August 2021.

arXiv:2107.08288 [pdf, other]

A Reproducing Kernel Hilbert Space Approach to Functional Calibration of Computer Models

Authors: Rui Tuo, Shiyuan He, Arash Pourhabib, Yu Ding, Jianhua Z. Huang

Abstract: This paper develops a frequentist solution to the functional calibration problem, where the value of a calibration parameter in a computer model is allowed to vary with the value of control variables in the physical system. The need of functional calibration is motivated by engineering applications where using a constant calibration parameter results in a significant mismatch between outputs from… ▽ More This paper develops a frequentist solution to the functional calibration problem, where the value of a calibration parameter in a computer model is allowed to vary with the value of control variables in the physical system. The need of functional calibration is motivated by engineering applications where using a constant calibration parameter results in a significant mismatch between outputs from the computer model and the physical experiment. Reproducing kernel Hilbert spaces (RKHS) are used to model the optimal calibration function, defined as the functional relationship between the calibration parameter and control variables that gives the best prediction. This optimal calibration function is estimated through penalized least squares with an RKHS-norm penalty and using physical data. An uncertainty quantification procedure is also developed for such estimates. Theoretical guarantees of the proposed method are provided in terms of prediction consistency and consistency of estimating the optimal calibration function. The proposed method is tested using both real and synthetic data and exhibits more robust performance in prediction and uncertainty quantification than the existing parametric functional calibration method and a state-of-art Bayesian method. △ Less

Submitted 17 July, 2021; originally announced July 2021.

MSC Class: 62G05; 62P30; 62F15

arXiv:2102.10631 [pdf, other]

Adaptive Importance Sampling for Efficient Stochastic Root Finding and Quantile Estimation

Authors: Shengyi He, Guangxin Jiang, Henry Lam, Michael C. Fu

Abstract: In solving simulation-based stochastic root-finding or optimization problems that involve rare events, such as in extreme quantile estimation, running crude Monte Carlo can be prohibitively inefficient. To address this issue, importance sampling can be employed to drive down the sampling error to a desirable level. However, selecting a good importance sampler requires knowledge of the solution to… ▽ More In solving simulation-based stochastic root-finding or optimization problems that involve rare events, such as in extreme quantile estimation, running crude Monte Carlo can be prohibitively inefficient. To address this issue, importance sampling can be employed to drive down the sampling error to a desirable level. However, selecting a good importance sampler requires knowledge of the solution to the problem at hand, which is the goal to begin with and thus forms a circular challenge. We investigate the use of adaptive importance sampling to untie this circularity. Our procedure sequentially updates the importance sampler to reach the optimal sampler and the optimal solution simultaneously, and can be embedded in both sample average approximation and stochastic approximation-type algorithms. Our theoretical analysis establishes strong consistency and asymptotic normality of the resulting estimators. We also demonstrate, via a minimax perspective, the key role of using adaptivity in controlling asymptotic errors. Finally, we illustrate the effectiveness of our approach via numerical experiments. △ Less

Submitted 21 February, 2021; originally announced February 2021.

arXiv:2101.02938 [pdf, other]

doi 10.1214/21-AOAS1440

Simultaneous inference of periods and period-luminosity relations for Mira variable stars

Authors: Shiyuan He, Zhenfeng Lin, Wenlong Yuan, Lucas M. Macri, Jianhua Z. Huang

Abstract: The Period--Luminosity relation (PLR) of Mira variable stars is an important tool to determine astronomical distances. The common approach of estimating the PLR is a two-step procedure that first estimates the Mira periods and then runs a linear regression of magnitude on log period. When the light curves are sparse and noisy, the accuracy of period estimation decreases and can suffer from aliasin… ▽ More The Period--Luminosity relation (PLR) of Mira variable stars is an important tool to determine astronomical distances. The common approach of estimating the PLR is a two-step procedure that first estimates the Mira periods and then runs a linear regression of magnitude on log period. When the light curves are sparse and noisy, the accuracy of period estimation decreases and can suffer from aliasing effects. Some methods improve accuracy by incorporating complex model structures at the expense of significant computational costs. Another drawback of existing methods is that they only provide point estimation without proper estimation of uncertainty. To overcome these challenges, we develop a hierarchical Bayesian model that simultaneously models the quasi-periodic variations for a collection of Mira light curves while estimating their common PLR. By borrowing strengths through the PLR, our method automatically reduces the aliasing effect, improves the accuracy of period estimation, and is capable of characterizing the estimation uncertainty. We develop a scalable stochastic variational inference algorithm for computation that can effectively deal with the multimodal posterior of period. The effectiveness of the proposed method is demonstrated through simulations, and an application to observations of Miras in the Local Group galaxy M33. Without using ad-hoc period correction tricks, our method achieves a distance estimate of M33 that is consistent with published work. Our method also shows superior robustness to downsampling of the light curves. △ Less

Submitted 8 January, 2021; originally announced January 2021.

arXiv:2101.02304 [pdf, other]

Statistical challenges in the analysis of sequence and structure data for the COVID-19 spike protein

Authors: Shiyu He, Samuel W. K. Wong

Abstract: As the major target of many vaccines and neutralizing antibodies against SARS-CoV-2, the spike (S) protein is observed to mutate over time. In this paper, we present statistical approaches to tackle some challenges associated with the analysis of S-protein data. We build a Bayesian hierarchical model to study the temporal and spatial evolution of S-protein sequences, after grouping the sequences i… ▽ More As the major target of many vaccines and neutralizing antibodies against SARS-CoV-2, the spike (S) protein is observed to mutate over time. In this paper, we present statistical approaches to tackle some challenges associated with the analysis of S-protein data. We build a Bayesian hierarchical model to study the temporal and spatial evolution of S-protein sequences, after grouping the sequences into representative clusters. We then apply sampling methods to investigate possible changes to the S-protein's 3-D structure as a result of commonly observed mutations. While the increasing spread of D614G variants has been noted in other research, our results also show that the co-occurring mutations of D614G together with S477N or A222V may spread even more rapidly, as quantified by our model estimates. △ Less

Submitted 30 January, 2021; v1 submitted 6 January, 2021; originally announced January 2021.

Comments: 21 pages, 5 figures

arXiv:2008.07318 [pdf, other]

Towards Dynamic Urban Bike Usage Prediction for Station Network Reconfiguration

Authors: Xi Yang, Suining He

Abstract: Bike sharing has become one of the major choices of transportation for residents in metropolitan cities worldwide. A station-based bike sharing system is usually operated in the way that a user picks up a bike from one station, and drops it off at another. Bike stations are, however, not static, as the bike stations are often reconfigured to accommodate changing demands or city urbanization over t… ▽ More Bike sharing has become one of the major choices of transportation for residents in metropolitan cities worldwide. A station-based bike sharing system is usually operated in the way that a user picks up a bike from one station, and drops it off at another. Bike stations are, however, not static, as the bike stations are often reconfigured to accommodate changing demands or city urbanization over time. One of the key operations is to evaluate candidate locations and install new stations to expand the bike sharing station network. Conventional practices have been studied to predict existing station usage, while evaluating new stations is highly challenging due to the lack of the historical bike usage. To fill this gap, in this work we propose a novel and efficient bike station-level prediction algorithm called AtCoR, which can predict the bike usage at both existing and new stations (candidate locations during reconfiguration). In order to address the lack of historical data issues, virtual historical usage of new stations is generated according to their correlations with the surrounding existing stations, for AtCoR model initialization. We have designed novel station-centered heatmaps which characterize for each target station centered at the heatmap the trend that riders travel between it and the station's neighboring regions, enabling the model to capture the learnable features of the bike station network. The captured features are further applied to the prediction of bike usage for new stations. Our extensive experiment study on more than 23 million trips from three major bike sharing systems in US, including New York City, Chicago and Los Angeles, shows that AtCoR outperforms baselines and state-of-art models in prediction of both existing and future stations. △ Less

Submitted 13 August, 2020; originally announced August 2020.

Comments: 9 pages, UrbComp 2020

arXiv:2006.15722 [pdf, other]

Deep Probabilistic Accelerated Evaluation: A Robust Certifiable Rare-Event Simulation Methodology for Black-Box Safety-Critical Systems

Authors: Mansur Arief, Zhiyuan Huang, Guru Koushik Senthil Kumar, Yuanlu Bai, Shengyi He, Wenhao Ding, Henry Lam, Ding Zhao

Abstract: Evaluating the reliability of intelligent physical systems against rare safety-critical events poses a huge testing burden for real-world applications. Simulation provides a useful platform to evaluate the extremal risks of these systems before their deployments. Importance Sampling (IS), while proven to be powerful for rare-event simulation, faces challenges in handling these learning-based syste… ▽ More Evaluating the reliability of intelligent physical systems against rare safety-critical events poses a huge testing burden for real-world applications. Simulation provides a useful platform to evaluate the extremal risks of these systems before their deployments. Importance Sampling (IS), while proven to be powerful for rare-event simulation, faces challenges in handling these learning-based systems due to their black-box nature that fundamentally undermines its efficiency guarantee, which can lead to under-estimation without diagnostically detected. We propose a framework called Deep Probabilistic Accelerated Evaluation (Deep-PrAE) to design statistically guaranteed IS, by converting black-box samplers that are versatile but could lack guarantees, into one with what we call a relaxed efficiency certificate that allows accurate estimation of bounds on the safety-critical event probability. We present the theory of Deep-PrAE that combines the dominating point concept with rare-event set learning via deep neural network classifiers, and demonstrate its effectiveness in numerical examples including the safety-testing of an intelligent driving algorithm. △ Less

Submitted 8 March, 2021; v1 submitted 28 June, 2020; originally announced June 2020.

arXiv:2006.09449 [pdf, other]

Network Diffusions via Neural Mean-Field Dynamics

Authors: Shushan He, Hongyuan Zha, Xiaojing Ye

Abstract: We propose a novel learning framework based on neural mean-field dynamics for inference and estimation problems of diffusion on networks. Our new framework is derived from the Mori-Zwanzig formalism to obtain an exact evolution of the node infection probabilities, which renders a delay differential equation with memory integral approximated by learnable time convolution operators, resulting in a h… ▽ More We propose a novel learning framework based on neural mean-field dynamics for inference and estimation problems of diffusion on networks. Our new framework is derived from the Mori-Zwanzig formalism to obtain an exact evolution of the node infection probabilities, which renders a delay differential equation with memory integral approximated by learnable time convolution operators, resulting in a highly structured and interpretable RNN. Directly using cascade data, our framework can jointly learn the structure of the diffusion network and the evolution of infection probabilities, which are cornerstone to important downstream applications such as influence maximization. Connections between parameter learning and optimal control are also established. Empirical study shows that our approach is versatile and robust to variations of the underlying diffusion network models, and significantly outperform existing approaches in accuracy and efficiency on both synthetic and real-world data. △ Less

Submitted 19 January, 2021; v1 submitted 16 June, 2020; originally announced June 2020.

Comments: Accepted by NIPS2020, 21 pages, 5 figures

arXiv:2006.07972 [pdf, other]

Sub-Seasonal Climate Forecasting via Machine Learning: Challenges, Analysis, and Advances

Authors: Sijie He, Xinyan Li, Timothy DelSole, Pradeep Ravikumar, Arindam Banerjee

Abstract: Sub-seasonal climate forecasting (SSF) focuses on predicting key climate variables such as temperature and precipitation in the 2-week to 2-month time scales. Skillful SSF would have immense societal value, in areas such as agricultural productivity, water resource management, transportation and aviation systems, and emergency planning for extreme weather events. However, SSF is considered more ch… ▽ More Sub-seasonal climate forecasting (SSF) focuses on predicting key climate variables such as temperature and precipitation in the 2-week to 2-month time scales. Skillful SSF would have immense societal value, in areas such as agricultural productivity, water resource management, transportation and aviation systems, and emergency planning for extreme weather events. However, SSF is considered more challenging than either weather prediction or even seasonal prediction. In this paper, we carefully study a variety of machine learning (ML) approaches for SSF over the US mainland. While atmosphere-land-ocean couplings and the limited amount of good quality data makes it hard to apply black-box ML naively, we show that with carefully constructed feature representations, even linear regression models, e.g., Lasso, can be made to perform well. Among a broad suite of 10 ML approaches considered, gradient boosting performs the best, and deep learning (DL) methods show some promise with careful architecture choices. Overall, suitable ML methods are able to outperform the climatological baseline, i.e., predictions based on the 30-year average at a given location and time. Further, based on studying feature importance, ocean (especially indices based on climatic oscillations such as El Nino) and land (soil moisture) covariates are found to be predictive, whereas atmospheric covariates are not considered helpful. △ Less

Submitted 24 June, 2020; v1 submitted 14 June, 2020; originally announced June 2020.

arXiv:2004.08113 [pdf, other]

Incorporating Multiple Cluster Centers for Multi-Label Learning

Authors: Senlin Shu, Fengmao Lv, Yan Yan, Li Li, Shuo He, Jun He

Abstract: Multi-label learning deals with the problem that each instance is associated with multiple labels simultaneously. Most of the existing approaches aim to improve the performance of multi-label learning by exploiting label correlations. Although the data augmentation technique is widely used in many machine learning tasks, it is still unclear whether data augmentation is helpful to multi-label learn… ▽ More Multi-label learning deals with the problem that each instance is associated with multiple labels simultaneously. Most of the existing approaches aim to improve the performance of multi-label learning by exploiting label correlations. Although the data augmentation technique is widely used in many machine learning tasks, it is still unclear whether data augmentation is helpful to multi-label learning. In this article, we propose to leverage the data augmentation technique to improve the performance of multi-label learning. Specifically, we first propose a novel data augmentation approach that performs clustering on the real examples and treats the cluster centers as virtual examples, and these virtual examples naturally embody the local label correlations and label importances. Then, motivated by the cluster assumption that examples in the same cluster should have the same label, we propose a novel regularization term to bridge the gap between the real examples and virtual examples, which can promote the local smoothness of the learning function. Extensive experimental results on a number of real-world multi-label datasets clearly demonstrate that our proposed approach outperforms the state-of-the-art counterparts. △ Less

Submitted 16 January, 2022; v1 submitted 17 April, 2020; originally announced April 2020.

Comments: 19 pages with 4 figures and 4 tables

arXiv:2003.06769 [pdf]

Multi-AI competing and winning against humans in iterated Rock-Paper-Scissors game

Authors: Lei Wang, Wenbin Huang, Yuanpeng Li, Julian Evans, Sailing He

Abstract: Predicting and modeling human behavior and finding trends within human decision-making processes is a major problem of social science. Rock Paper Scissors (RPS) is the fundamental strategic question in many game theory problems and real-world competitions. Finding the right approach to beat a particular human opponent is challenging. Here we use an AI (artificial intelligence) algorithm based on M… ▽ More Predicting and modeling human behavior and finding trends within human decision-making processes is a major problem of social science. Rock Paper Scissors (RPS) is the fundamental strategic question in many game theory problems and real-world competitions. Finding the right approach to beat a particular human opponent is challenging. Here we use an AI (artificial intelligence) algorithm based on Markov Models of one fixed memory length (abbreviated as "single AI") to compete against humans in an iterated RPS game. We model and predict human competition behavior by combining many Markov Models with different fixed memory lengths (abbreviated as "multi-AI"), and develop an architecture of multi-AI with changeable parameters to adapt to different competition strategies. We introduce a parameter called "focus length" (a positive number such as 5 or 10) to control the speed and sensitivity for our multi-AI to adapt to the opponent's strategy change. The focus length is the number of previous rounds that the multi-AI should look at when determining which Single-AI has the best performance and should choose to play for the next game. We experimented with 52 different people, each playing 300 rounds continuously against one specific multi-AI model, and demonstrated that our strategy could win against more than 95% of human opponents. △ Less

Submitted 22 November, 2020; v1 submitted 15 March, 2020; originally announced March 2020.

arXiv:2003.00359 [pdf, other]

Contextual-Bandit Based Personalized Recommendation with Time-Varying User Interests

Authors: Xiao Xu, Fang Dong, Yanghua Li, Shaojian He, Xin Li

Abstract: A contextual bandit problem is studied in a highly non-stationary environment, which is ubiquitous in various recommender systems due to the time-varying interests of users. Two models with disjoint and hybrid payoffs are considered to characterize the phenomenon that users' preferences towards different items vary differently over time. In the disjoint payoff model, the reward of playing an arm i… ▽ More A contextual bandit problem is studied in a highly non-stationary environment, which is ubiquitous in various recommender systems due to the time-varying interests of users. Two models with disjoint and hybrid payoffs are considered to characterize the phenomenon that users' preferences towards different items vary differently over time. In the disjoint payoff model, the reward of playing an arm is determined by an arm-specific preference vector, which is piecewise-stationary with asynchronous and distinct changes across different arms. An efficient learning algorithm that is adaptive to abrupt reward changes is proposed and theoretical regret analysis is provided to show that a sublinear scaling of regret in the time length $T$ is achieved. The algorithm is further extended to a more general setting with hybrid payoffs where the reward of playing an arm is determined by both an arm-specific preference vector and a joint coefficient vector shared by all arms. Empirical experiments are conducted on real-world datasets to verify the advantages of the proposed learning algorithms against baseline ones in both settings. △ Less

Submitted 29 February, 2020; originally announced March 2020.

Comments: Accepted by AAAI 20

arXiv:2002.03763 [pdf, other]

Learning Numerical Observers using Unsupervised Domain Adaptation

Authors: Shenghua He, Weimin Zhou, Hua Li, Mark A. Anastasio

Abstract: Medical imaging systems are commonly assessed by use of objective image quality measures. Supervised deep learning methods have been investigated to implement numerical observers for task-based image quality assessment. However, labeling large amounts of experimental data to train deep neural networks is tedious, expensive, and prone to subjective errors. Computer-simulated image data can potentia… ▽ More Medical imaging systems are commonly assessed by use of objective image quality measures. Supervised deep learning methods have been investigated to implement numerical observers for task-based image quality assessment. However, labeling large amounts of experimental data to train deep neural networks is tedious, expensive, and prone to subjective errors. Computer-simulated image data can potentially be employed to circumvent these issues; however, it is often difficult to computationally model complicated anatomical structures, noise sources, and the response of real world imaging systems. Hence, simulated image data will generally possess physical and statistical differences from the experimental image data they seek to emulate. Within the context of machine learning, these differences between the sets of two images is referred to as domain shift. In this study, we propose and investigate the use of an adversarial domain adaptation method to mitigate the deleterious effects of domain shift between simulated and experimental image data for deep learning-based numerical observers (DL-NOs) that are trained on simulated images but applied to experimental ones. In the proposed method, a DL-NO will initially be trained on computer-simulated image data and subsequently adapted for use with experimental image data, without the need for any labeled experimental images. As a proof of concept, a binary signal detection task is considered. The success of this strategy as a function of the degree of domain shift present between the simulated and experimental image data is investigated. △ Less

Submitted 22 February, 2020; v1 submitted 3 February, 2020; originally announced February 2020.

Comments: SPIE Medical Imaging 2020 (Oral)

arXiv:2001.11425 [pdf, other]

doi 10.1080/00401706.2021.2008502

Functional PCA with Covariate Dependent Mean and Covariance Structure

Authors: Fei Ding, Shiyuan He, David E. Jones, Jianhua Z. Huang

Abstract: Incorporating covariates into functional principal component analysis (PCA) can substantially improve the representation efficiency of the principal components and predictive performance. However, many existing functional PCA methods do not make use of covariates, and those that do often have high computational cost or make overly simplistic assumptions that are violated in practice. In this artic… ▽ More Incorporating covariates into functional principal component analysis (PCA) can substantially improve the representation efficiency of the principal components and predictive performance. However, many existing functional PCA methods do not make use of covariates, and those that do often have high computational cost or make overly simplistic assumptions that are violated in practice. In this article, we propose a new framework, called Covariate Dependent Functional Principal Component Analysis (CD-FPCA), in which both the mean and covariance structure depend on covariates. We propose a corresponding estimation algorithm, which makes use of spline basis representations and roughness penalties, and is substantially more computationally efficient than competing approaches of adequate estimation and prediction accuracy. A key aspect of our work is our novel approach for modeling the covariance function and ensuring that it is symmetric positive semi-definite. We demonstrate the advantages of our methodology through a simulation study and an astronomical data analysis. △ Less

Submitted 19 August, 2023; v1 submitted 30 January, 2020; originally announced January 2020.

Comments: 28 pages, 3 figures

Journal ref: Technometrics, 64(3), 335-345 (2022)

arXiv:1910.07899 [pdf, other]

Design, Benchmarking and Explainability Analysis of a Game-Theoretic Framework towards Energy Efficiency in Smart Infrastructure

Authors: Ioannis C. Konstantakopoulos, Hari Prasanna Das, Andrew R. Barkan, Shiying He, Tanya Veeravalli, Huihan Liu, Aummul Baneen Manasawala, Yu-Wen Lin, Costas J. Spanos

Abstract: In this paper, we propose a gamification approach as a novel framework for smart building infrastructure with the goal of motivating human occupants to reconsider personal energy usage and to have positive effects on their environment. Human interaction in the context of cyber-physical systems is a core component and consideration in the implementation of any smart building technology. Research ha… ▽ More In this paper, we propose a gamification approach as a novel framework for smart building infrastructure with the goal of motivating human occupants to reconsider personal energy usage and to have positive effects on their environment. Human interaction in the context of cyber-physical systems is a core component and consideration in the implementation of any smart building technology. Research has shown that the adoption of human-centric building services and amenities leads to improvements in the operational efficiency of these cyber-physical systems directed towards controlling building energy usage. We introduce a strategy in form of a game-theoretic framework that incorporates humans-in-the-loop modeling by creating an interface to allow building managers to interact with occupants and potentially incentivize energy efficient behavior. Prior works on game theoretic analysis typically rely on the assumption that the utility function of each individual agent is known a priori. Instead, we propose novel utility learning framework for benchmarking that employs robust estimations of occupant actions towards energy efficiency. To improve forecasting performance, we extend the utility learning scheme by leveraging deep bi-directional recurrent neural networks. Using the proposed methods on data gathered from occupant actions for resources such as room lighting, we forecast patterns of energy resource usage to demonstrate the prediction performance of the methods. The results of our study show that we can achieve a highly accurate representation of the ground truth for occupant energy resource usage. We also demonstrate the explainable nature on human decision making towards energy usage inherent in the dataset using graphical lasso and granger causality algorithms. Finally, we open source the de-identified, high-dimensional data pertaining to the energy game-theoretic framework. △ Less

Submitted 16 October, 2019; originally announced October 2019.

Comments: arXiv admin note: substantial text overlap with arXiv:1809.05142, arXiv:1810.10533

arXiv:1902.03047 [pdf, other]

Collaboration based Multi-Label Learning

Authors: Lei Feng, Bo An, Shuo He

Abstract: It is well-known that exploiting label correlations is crucially important to multi-label learning. Most of the existing approaches take label correlations as prior knowledge, which may not correctly characterize the real relationships among labels. Besides, label correlations are normally used to regularize the hypothesis space, while the final predictions are not explicitly correlated. In this p… ▽ More It is well-known that exploiting label correlations is crucially important to multi-label learning. Most of the existing approaches take label correlations as prior knowledge, which may not correctly characterize the real relationships among labels. Besides, label correlations are normally used to regularize the hypothesis space, while the final predictions are not explicitly correlated. In this paper, we suggest that for each individual label, the final prediction involves the collaboration between its own prediction and the predictions of other labels. Based on this assumption, we first propose a novel method to learn the label correlations via sparse reconstruction in the label space. Then, by seamlessly integrating the learned label correlations into model training, we propose a novel multi-label learning approach that aims to explicitly account for the correlated predictions of labels while training the desired model simultaneously. Extensive experimental results show that our approach outperforms the state-of-the-art counterparts. △ Less

Submitted 8 February, 2019; originally announced February 2019.

Comments: Accepted by AAAI-19

arXiv:1810.00867 [pdf, other]

Domain-Adversarial Multi-Task Framework for Novel Therapeutic Property Prediction of Compounds

Authors: Lingwei Xie, Song He, Shu Yang, Boyuan Feng, Kun Wan, Zhongnan Zhang, Xiaochen Bo, Yufei Ding

Abstract: With the rapid development of high-throughput technologies, parallel acquisition of large-scale drug-informatics data provides huge opportunities to improve pharmaceutical research and development. One significant application is the purpose prediction of small molecule compounds, aiming to specify therapeutic properties of extensive purpose-unknown compounds and to repurpose novel therapeutic prop… ▽ More With the rapid development of high-throughput technologies, parallel acquisition of large-scale drug-informatics data provides huge opportunities to improve pharmaceutical research and development. One significant application is the purpose prediction of small molecule compounds, aiming to specify therapeutic properties of extensive purpose-unknown compounds and to repurpose novel therapeutic properties of FDA-approved drugs. Such problem is very challenging since compound attributes contain heterogeneous data with various feature patterns such as drug fingerprint, drug physicochemical property, drug perturbation gene expression. Moreover, there is complex nonlinear dependency among heterogeneous data. In this paper, we propose a novel domain-adversarial multi-task framework for integrating shared knowledge from multiple domains. The framework utilizes the adversarial strategy to effectively learn target representations and models their nonlinear dependency. Experiments on two real-world datasets illustrate that the performance of our approach obtains an obvious improvement over competitive baselines. The novel therapeutic properties of purpose-unknown compounds we predicted are mostly reported or brought to the clinics. Furthermore, our framework can integrate various attributes beyond the three domains examined here and can be applied in the industry for screening the purpose of huge amounts of as yet unidentified compounds. Source codes of this paper are available on Github. △ Less

Submitted 28 September, 2018; originally announced October 2018.

Comments: 9 pages, 6 figures

arXiv:1809.05142 [pdf, other]

A Deep Learning and Gamification Approach to Energy Conservation at Nanyang Technological University

Authors: Ioannis C. Konstantakopoulos, Andrew R. Barkan, Shiying He, Tanya Veeravalli, Huihan Liu, Costas Spanos

Abstract: The implementation of smart building technology in the form of smart infrastructure applications has great potential to improve sustainability and energy efficiency by leveraging humans-in-the-loop strategy. However, human preference in regard to living conditions is usually unknown and heterogeneous in its manifestation as control inputs to a building. Furthermore, the occupants of a building typ… ▽ More The implementation of smart building technology in the form of smart infrastructure applications has great potential to improve sustainability and energy efficiency by leveraging humans-in-the-loop strategy. However, human preference in regard to living conditions is usually unknown and heterogeneous in its manifestation as control inputs to a building. Furthermore, the occupants of a building typically lack the independent motivation necessary to contribute to and play a key role in the control of smart building infrastructure. Moreover, true human actions and their integration with sensing/actuation platforms remains unknown to the decision maker tasked with improving operational efficiency. By modeling user interaction as a sequential discrete game between non-cooperative players, we introduce a gamification approach for supporting user engagement and integration in a human-centric cyber-physical system. We propose the design and implementation of a large-scale network game with the goal of improving the energy efficiency of a building through the utilization of cutting-edge Internet of Things (IoT) sensors and cyber-physical systems sensing/actuation platforms. A benchmark utility learning framework that employs robust estimations for classical discrete choice models provided for the derived high dimensional imbalanced data. To improve forecasting performance, we extend the benchmark utility learning scheme by leveraging Deep Learning end-to-end training with Deep bi-directional Recurrent Neural Networks. We apply the proposed methods to high dimensional data from a social game experiment designed to encourage energy efficient behavior among smart building occupants in Nanyang Technological University (NTU) residential housing. Using occupant-retrieved actions for resources such as lighting and A/C, we simulate the game defined by the estimated utility functions. △ Less

Submitted 25 September, 2018; v1 submitted 13 September, 2018; originally announced September 2018.

Comments: 16 double pages, shorter version submitted to Applied Energy Journal

arXiv:1807.11158 [pdf, other]

Robust Student Network Learning

Authors: Tianyu Guo, Chang Xu, Shiyi He, Boxin Shi, Chao Xu, Dacheng Tao

Abstract: Deep neural networks bring in impressive accuracy in various applications, but the success often relies on the heavy network architecture. Taking well-trained heavy networks as teachers, classical teacher-student learning paradigm aims to learn a student network that is lightweight yet accurate. In this way, a portable student network with significantly fewer parameters can achieve a considerable… ▽ More Deep neural networks bring in impressive accuracy in various applications, but the success often relies on the heavy network architecture. Taking well-trained heavy networks as teachers, classical teacher-student learning paradigm aims to learn a student network that is lightweight yet accurate. In this way, a portable student network with significantly fewer parameters can achieve a considerable accuracy which is comparable to that of teacher network. However, beyond accuracy, robustness of the learned student network against perturbation is also essential for practical uses. Existing teacher-student learning frameworks mainly focus on accuracy and compression ratios, but ignore the robustness. In this paper, we make the student network produce more confident predictions with the help of the teacher network, and analyze the lower bound of the perturbation that will destroy the confidence of the student network. Two important objectives regarding prediction scores and gradients of examples are developed to maximize this lower bound, so as to enhance the robustness of the student network without sacrificing the performance. Experiments on benchmark datasets demonstrate the efficiency of the proposed approach to learn robust student networks which have satisfying accuracy and compact sizes. △ Less

Submitted 30 July, 2018; v1 submitted 29 July, 2018; originally announced July 2018.

arXiv:1805.09023 [pdf, other]

Addressing the Item Cold-start Problem by Attribute-driven Active Learning

Authors: Yu Zhu, Jinhao Lin, Shibi He, Beidou Wang, Ziyu Guan, Haifeng Liu, Deng Cai

Abstract: In recommender systems, cold-start issues are situations where no previous events, e.g. ratings, are known for certain users or items. In this paper, we focus on the item cold-start problem. Both content information (e.g. item attributes) and initial user ratings are valuable for seizing users' preferences on a new item. However, previous methods for the item cold-start problem either 1) incorpora… ▽ More In recommender systems, cold-start issues are situations where no previous events, e.g. ratings, are known for certain users or items. In this paper, we focus on the item cold-start problem. Both content information (e.g. item attributes) and initial user ratings are valuable for seizing users' preferences on a new item. However, previous methods for the item cold-start problem either 1) incorporate content information into collaborative filtering to perform hybrid recommendation, or 2) actively select users to rate the new item without considering content information and then do collaborative filtering. In this paper, we propose a novel recommendation scheme for the item cold-start problem by leverage both active learning and items' attribute information. Specifically, we design useful user selection criteria based on items' attributes and users' rating history, and combine the criteria in an optimization framework for selecting users. By exploiting the feedback ratings, users' previous ratings and items' attributes, we then generate accurate rating predictions for the other unselected users. Experimental results on two real-world datasets show the superiority of our proposed method over traditional methods. △ Less

Submitted 23 May, 2018; originally announced May 2018.

Comments: 14 pages, 7 figures, 9 tables. Submitted to TKDE

ACM Class: H.3.3

arXiv:1805.00159 [pdf, other]

doi 10.1093/mnras/stz539

Detecting Galaxy-Filament Alignments in the Sloan Digital Sky Survey III

Authors: Yen-Chi Chen, Shirley Ho, Jonathan Blazek, Siyu He, Rachel Mandelbaum, Peter Melchior, Sukhdeep Singh

Abstract: Previous studies have shown the filamentary structures in the cosmic web influence the alignments of nearby galaxies. We study this effect in the LOWZ sample of the Sloan Digital Sky Survey using the "Cosmic Web Reconstruction" filament catalogue. We find that LOWZ galaxies exhibit a small but statistically significant alignment in the direction parallel to the orientation of nearby filaments. Thi… ▽ More Previous studies have shown the filamentary structures in the cosmic web influence the alignments of nearby galaxies. We study this effect in the LOWZ sample of the Sloan Digital Sky Survey using the "Cosmic Web Reconstruction" filament catalogue. We find that LOWZ galaxies exhibit a small but statistically significant alignment in the direction parallel to the orientation of nearby filaments. This effect is detectable even in the absence of nearby galaxy clusters, which suggests it is an effect from the matter distribution in the filament. A nonparametric regression model suggests that the alignment effect with filaments extends over separations of 30-40 Mpc. We find that galaxies that are bright and early-forming align more strongly with the directions of nearby filaments than those that are faint and late-forming; however, trends with stellar mass are less statistically significant, within the narrow range of stellar mass of this sample. △ Less

Submitted 21 February, 2019; v1 submitted 30 April, 2018; originally announced May 2018.

Comments: 14 pages, 13 figures. Accepted to the MNRAS

arXiv:1709.03891 [pdf, other]

High-Dimensional Dependency Structure Learning for Physical Processes

Authors: Jamal Golmohammadi, Imme Ebert-Uphoff, Sijie He, Yi Deng, Arindam Banerjee

Abstract: In this paper, we consider the use of structure learning methods for probabilistic graphical models to identify statistical dependencies in high-dimensional physical processes. Such processes are often synthetically characterized using PDEs (partial differential equations) and are observed in a variety of natural phenomena, including geoscience data capturing atmospheric and hydrological phenomena… ▽ More In this paper, we consider the use of structure learning methods for probabilistic graphical models to identify statistical dependencies in high-dimensional physical processes. Such processes are often synthetically characterized using PDEs (partial differential equations) and are observed in a variety of natural phenomena, including geoscience data capturing atmospheric and hydrological phenomena. Classical structure learning approaches such as the PC algorithm and variants are challenging to apply due to their high computational and sample requirements. Modern approaches, often based on sparse regression and variants, do come with finite sample guarantees, but are usually highly sensitive to the choice of hyper-parameters, e.g., parameter $λ$ for sparsity inducing constraint or regularization. In this paper, we present ACLIME-ADMM, an efficient two-step algorithm for adaptive structure learning, which estimates an edge specific parameter $λ_{ij}$ in the first step, and uses these parameters to learn the structure in the second step. Both steps of our algorithm use (inexact) ADMM to solve suitable linear programs, and all iterations can be done in closed form in an efficient block parallel manner. We compare ACLIME-ADMM with baselines on both synthetic data simulated by partial differential equations (PDEs) that model advection-diffusion processes, and real data (50 years) of daily global geopotential heights to study information flow in the atmosphere. ACLIME-ADMM is shown to be efficient, stable, and competitive, usually better than the baselines especially on difficult problems. On real data, ACLIME-ADMM recovers the underlying structure of global atmospheric circulation, including switches in wind directions at the equator and tropics entirely from the data. △ Less

Submitted 12 September, 2017; originally announced September 2017.

Comments: 21 pages, 8 figures, International Conference on Data Mining 2017

arXiv:1708.04742 [pdf, ps, other]

doi 10.3847/1538-3881/aa86f1

Large Magellanic Cloud Near-Infrared Synoptic Survey. V. Period-Luminosity Relations of Miras

Authors: Wenlong Yuan, Lucas M. Macri, Shiyuan He, Jianhua Z. Huang, Shashi M. Kanbur, Chow-Choong Ngeow

Abstract: We study the near-infrared properties of 690 Mira candidates in the central region of the Large Magellanic Cloud, based on time-series observations at JHKs. We use densely-sampled I-band observations from the OGLE project to generate template light curves in the near infrared and derive robust mean magnitudes at those wavelengths. We obtain near-infrared Period-Luminosity relations for Oxygen-rich… ▽ More We study the near-infrared properties of 690 Mira candidates in the central region of the Large Magellanic Cloud, based on time-series observations at JHKs. We use densely-sampled I-band observations from the OGLE project to generate template light curves in the near infrared and derive robust mean magnitudes at those wavelengths. We obtain near-infrared Period-Luminosity relations for Oxygen-rich Miras with a scatter as low as 0.12 mag at Ks. We study the Period-Luminosity-Color relations and the color excesses of Carbon-rich Miras, which show evidence for a substantially different reddening law. △ Less

Submitted 15 August, 2017; originally announced August 2017.

Comments: Accepted for publication in The Astronomical Journal

arXiv:1708.02079 [pdf, other]

The discrete moment problem with nonconvex shape constraints

Authors: Xi Chen, Simai He, Bo Jiang, Christopher Thomas Ryan, Teng Zhang

Abstract: The discrete moment problem is a foundational problem in distribution-free robust optimization, where the goal is to find a worst-case distribution that satisfies a given set of moments. This paper studies the discrete moment problems with additional "shape constraints" that guarantee the worst case distribution is either log-concave or has an increasing failure rate. These classes of shape constr… ▽ More The discrete moment problem is a foundational problem in distribution-free robust optimization, where the goal is to find a worst-case distribution that satisfies a given set of moments. This paper studies the discrete moment problems with additional "shape constraints" that guarantee the worst case distribution is either log-concave or has an increasing failure rate. These classes of shape constraints have not previously been studied in the literature, in part due to their inherent nonconvexities. Nonetheless, these classes of distributions are useful in practice. We characterize the structure of optimal extreme point distributions by developing new results in reverse convex optimization, a lesser-known tool previously employed in designing global optimization algorithms. We are able to show, for example, that an optimal extreme point solution to a moment problem with $m$ moments and log-concave shape constraints is piecewise geometric with at most $m$ pieces. Moreover, this structure allows us to design an exact algorithm for computing optimal solutions in a low-dimensional space of parameters. Moreover, We describe a computational approach to solving these low-dimensional problems, including numerical results for a representative set of instances. △ Less

Submitted 7 August, 2017; originally announced August 2017.

Comments: 31 pages, 3 figures

arXiv:1707.06635 [pdf]

Impact of the Global Crisis on SME Internal vs. External Financing in China

Authors: ShiXue He, Marcel Ausloos

Abstract: Changes in the capital structure before and after the global financial crisis for SMEs are studied, emphasizing their financing problems, distinguishing between internal financing and external financing determinants. The empirical research bears upon 158 small and medium-sized firms listed on Shenzhen and Shanghai Stock Exchanges in China over the period of 2004-2014. A regression analysis, along… ▽ More Changes in the capital structure before and after the global financial crisis for SMEs are studied, emphasizing their financing problems, distinguishing between internal financing and external financing determinants. The empirical research bears upon 158 small and medium-sized firms listed on Shenzhen and Shanghai Stock Exchanges in China over the period of 2004-2014. A regression analysis, along the lines of the Trade-Off Theory, shows that the leverage decreases with profitability, non-debt tax shields and the liquidity, and increases with firm size and tangibility. A positive relationship is found between firm growth and debt ratio, though not highly significantly. It is shown that the SMEs with high growth rates are those which will more easily obtain external financing after a financial crisis. It is recognized that the China government should reconsider SMEs taxation laws. △ Less

Submitted 1 July, 2017; originally announced July 2017.

Comments: 17 pages, 6 tables, 43 references

Journal ref: Banking and Finance Review, 9(1) (2017) 1-17

arXiv:1703.01000 [pdf, ps, other]

doi 10.3847/1538-3881/aa63f1

The M33 Synoptic Stellar Survey. II. Mira Variables

Authors: Wenlong Yuan, Shiyuan He, Lucas M. Macri, James Long, Jianhua Z. Huang

Abstract: We present the discovery of 1847 Mira candidates in the Local Group galaxy M33 using a novel semi-parametric periodogram technique coupled with a Random Forest classifier. The algorithms were applied to ~2.4x10^5 I-band light curves previously obtained by the M33 Synoptic Stellar Survey. We derive preliminary Period-Luminosity relations at optical, near- & mid-infrared wavelengths and compare them… ▽ More We present the discovery of 1847 Mira candidates in the Local Group galaxy M33 using a novel semi-parametric periodogram technique coupled with a Random Forest classifier. The algorithms were applied to ~2.4x10^5 I-band light curves previously obtained by the M33 Synoptic Stellar Survey. We derive preliminary Period-Luminosity relations at optical, near- & mid-infrared wavelengths and compare them to the corresponding relations in the Large Magellanic Cloud. △ Less

Submitted 24 March, 2017; v1 submitted 2 March, 2017; originally announced March 2017.

Comments: Includes small corrections to match the published version

Journal ref: The Astronomical Journal, 153:170 (2017)

arXiv:1611.01606 [pdf, other]

Learning to Play in a Day: Faster Deep Reinforcement Learning by Optimality Tightening

Authors: Frank S. He, Yang Liu, Alexander G. Schwing, Jian Peng

Abstract: We propose a novel training algorithm for reinforcement learning which combines the strength of deep Q-learning with a constrained optimization approach to tighten optimality and encourage faster reward propagation. Our novel technique makes deep reinforcement learning more practical by drastically reducing the training time. We evaluate the performance of our approach on the 49 games of the chall… ▽ More We propose a novel training algorithm for reinforcement learning which combines the strength of deep Q-learning with a constrained optimization approach to tighten optimality and encourage faster reward propagation. Our novel technique makes deep reinforcement learning more practical by drastically reducing the training time. We evaluate the performance of our approach on the 49 games of the challenging Arcade Learning Environment, and report significant improvements in both training time and accuracy. △ Less

Submitted 5 November, 2016; originally announced November 2016.

arXiv:1609.06680 [pdf, ps, other]

doi 10.3847/0004-6256/152/6/164

Period estimation for sparsely-sampled quasi-periodic light curves applied to Miras

Authors: Shiyuan He, Wenlong Yuan, Jianhua Z. Huang, James Long, Lucas M. Macri

Abstract: We develop a non-linear semi-parametric Gaussian process model to estimate periods of Miras with sparsely-sampled light curves. The model uses a sinusoidal basis for the periodic variation and a Gaussian process for the stochastic changes. We use maximum likelihood to estimate the period and the parameters of the Gaussian process, while integrating out the effects of other nuisance parameters in t… ▽ More We develop a non-linear semi-parametric Gaussian process model to estimate periods of Miras with sparsely-sampled light curves. The model uses a sinusoidal basis for the periodic variation and a Gaussian process for the stochastic changes. We use maximum likelihood to estimate the period and the parameters of the Gaussian process, while integrating out the effects of other nuisance parameters in the model with respect to a suitable prior distribution obtained from earlier studies. Since the likelihood is highly multimodal for period, we implement a hybrid method that applies the quasi-Newton algorithm for Gaussian process parameters and search the period/frequency parameter over a dense grid. A large-scale, high-fidelity simulation is conducted to mimic the sampling quality of Mira light curves obtained by the M33 Synoptic Stellar Survey. The simulated data set is publicly available and can serve as a testbed for future evaluation of different period estimation methods. The semi-parametric model outperforms an existing algorithm on this simulated test data set as measured by period recovery rate and quality of the resulting Period-Luminosity relations. △ Less

Submitted 17 November, 2016; v1 submitted 21 September, 2016; originally announced September 2016.

Comments: Changes in v3: minor edits to match the published version. Software package and test data set available at http://github.com/shiyuanhe/varStar

Journal ref: The Astronomical Journal, 152 (6), 164 (2016)

arXiv:1501.00537 [pdf]

A theoretical foundation of the target-decoy search strategy for false discovery rate control in proteomics

Authors: Kun He, Yan Fu, Wen-Feng Zeng, Lan Luo, Hao Chi, Chao Liu, Lai-Yun Qing, Rui-Xiang Sun, Si-Min He

Abstract: Motivation: Target-decoy search (TDS) is currently the most popular strategy for estimating and controlling the false discovery rate (FDR) of peptide identifications in mass spectrometry-based shotgun proteomics. While this strategy is very useful in practice and has been intensively studied empirically, its theoretical foundation has not yet been well established. Result: In this work, we systema… ▽ More Motivation: Target-decoy search (TDS) is currently the most popular strategy for estimating and controlling the false discovery rate (FDR) of peptide identifications in mass spectrometry-based shotgun proteomics. While this strategy is very useful in practice and has been intensively studied empirically, its theoretical foundation has not yet been well established. Result: In this work, we systematically analyze the TDS strategy in a rigorous statistical sense. We prove that the commonly used concatenated TDS provides a conservative estimate of the FDR for any given score threshold, but it cannot rigorously control the FDR. We prove that with a slight modification to the commonly used formula for FDR estimation, the peptide-level FDR can be rigorously controlled based on the concatenated TDS. We show that the spectrum-level FDR control is difficult. We verify the theoretical conclusions with real mass spectrometry data. △ Less

Submitted 3 January, 2015; originally announced January 2015.

Comments: 7 pages, 2 figures

arXiv:1407.8382 [pdf, ps, other]

doi 10.1214/14-AOAS724

Detection boundary and Higher Criticism approach for rare and weak genetic effects

Authors: Zheyang Wu, Yiming Sun, Shiquan He, Judy Cho, Hongyu Zhao, Jiashun Jin

Abstract: Genome-wide association studies (GWAS) have identified many genetic factors underlying complex human traits. However, these factors have explained only a small fraction of these traits' genetic heritability. It is argued that many more genetic factors remain undiscovered. These genetic factors likely are weakly associated at the population level and sparsely distributed across the genome. In this… ▽ More Genome-wide association studies (GWAS) have identified many genetic factors underlying complex human traits. However, these factors have explained only a small fraction of these traits' genetic heritability. It is argued that many more genetic factors remain undiscovered. These genetic factors likely are weakly associated at the population level and sparsely distributed across the genome. In this paper, we adapt the recent innovations on Tukey's Higher Criticism (Tukey [The Higher Criticism (1976) Princeton Univ.]; Donoho and Jin [Ann. Statist. 32 (2004) 962-994]) to SNP-set analysis of GWAS, and develop a new theoretical framework in large-scale inference to assess the joint significance of such rare and weak effects for a quantitative trait. In the core of our theory is the so-called detection boundary, a curve in the two-dimensional phase space that quantifies the rarity and strength of genetic effects. Above the detection boundary, the overall effects of genetic factors are strong enough for reliable detection. Below the detection boundary, the genetic factors are simply too rare and too weak for reliable detection. We show that the HC-type methods are optimal in that they reliably yield detection once the parameters of the genetic effects fall above the detection boundary and that many commonly used SNP-set methods are suboptimal. The superior performance of the HC-type approach is demonstrated through simulations and the analysis of a GWAS data set of Crohn's disease. △ Less

Submitted 31 July, 2014; originally announced July 2014.

Comments: Published in at http://dx.doi.org/10.1214/14-AOAS724 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org)

Report number: IMS-AOAS-AOAS724

Journal ref: Annals of Applied Statistics 2014, Vol. 8, No. 2, 824-851

Showing 1–49 of 49 results for author: He, S