Search | arXiv e-print repository

arXiv:2407.01770 [pdf, other]

Exploring causal effects of hormone- and radio-treatments in an observational study of breast cancer using copula-based semi-competing risks models

Authors: Tonghui Yu, Mengjiao Peng, Yifan Cui, Elynn Chen, Chixiang Chen

Abstract: Breast cancer patients may experience relapse or death after surgery during the follow-up period, leading to dependent censoring of relapse. This phenomenon, known as semi-competing risk, imposes challenges in analyzing treatment effects on breast cancer and necessitates advanced statistical tools for unbiased analysis. Despite progress in estimation and inference within semi-competing risks regre… ▽ More Breast cancer patients may experience relapse or death after surgery during the follow-up period, leading to dependent censoring of relapse. This phenomenon, known as semi-competing risk, imposes challenges in analyzing treatment effects on breast cancer and necessitates advanced statistical tools for unbiased analysis. Despite progress in estimation and inference within semi-competing risks regression, its application to causal inference is still in its early stages. This article aims to propose a frequentist and semi-parametric framework based on copula models that can facilitate valid causal inference, net quantity estimation and interpretation, and sensitivity analysis for unmeasured factors under right-censored semi-competing risks data. We also propose novel procedures to enhance parameter estimation and its applicability in real practice. After that, we apply the proposed framework to a breast cancer study and detect the time-varying causal effects of hormone- and radio-treatments on patients' relapse-free survival and overall survival. Moreover, extensive numerical evaluations demonstrate the method's feasibility, highlighting minimal estimation bias and reliable statistical inference. △ Less

Submitted 1 July, 2024; originally announced July 2024.

Comments: Contact: [email protected]

arXiv:2407.00561 [pdf, ps, other]

Advancing Information Integration through Empirical Likelihood: Selective Reviews and a New Idea

Authors: Chixiang Chen, Jia Liang, Elynn Chen, Ming Wang

Abstract: Information integration plays a pivotal role in biomedical studies by facilitating the combination and analysis of independent datasets from multiple studies, thereby uncovering valuable insights that might otherwise remain obscured due to the limited sample size in individual studies. However, sharing raw data from independent studies presents significant challenges, primarily due to the need to… ▽ More Information integration plays a pivotal role in biomedical studies by facilitating the combination and analysis of independent datasets from multiple studies, thereby uncovering valuable insights that might otherwise remain obscured due to the limited sample size in individual studies. However, sharing raw data from independent studies presents significant challenges, primarily due to the need to safeguard sensitive participant information and the cumbersome paperwork involved in data sharing. In this article, we first provide a selective review of recent methodological developments in information integration via empirical likelihood, wherein only summary information is required, rather than the raw data. Following this, we introduce a new insight and a potentially promising framework that could broaden the application of information integration across a wider spectrum. Furthermore, this new framework offers computational convenience compared to classic empirical likelihood-based methods. We provide numerical evaluations to assess its performance and discuss various extensions in the end. △ Less

Submitted 29 June, 2024; originally announced July 2024.

arXiv:2406.10778 [pdf, other]

Heterogeneous Entity Representation for Medicinal Synergy Prediction

Authors: Jiawei Wu, Jun Wen, Mingyuan Yan, Anqi Dong, Can Chen

Abstract: Medicinal synergy prediction is a powerful tool in drug discovery and development that harnesses the principles of combination therapy to enhance therapeutic outcomes by improving efficacy, reducing toxicity, and preventing drug resistance. While a myriad of computational methods has emerged for predicting synergistic drug combinations, a large portion of them may overlook the intricate, yet criti… ▽ More Medicinal synergy prediction is a powerful tool in drug discovery and development that harnesses the principles of combination therapy to enhance therapeutic outcomes by improving efficacy, reducing toxicity, and preventing drug resistance. While a myriad of computational methods has emerged for predicting synergistic drug combinations, a large portion of them may overlook the intricate, yet critical relationships between various entities in drug interaction networks, such as drugs, cell lines, and diseases. These relationships are complex and multidimensional, requiring sophisticated modeling to capture nuanced interplay that can significantly influence therapeutic efficacy. We introduce a salient deep hypergraph learning method, namely, Heterogeneous Entity Representation for MEdicinal Synergy prediction (HERMES), to predict anti-cancer drug synergy. HERMES integrates heterogeneous data sources, encompassing drug, cell line, and disease information, to provide a comprehensive understanding of the interactions involved. By leveraging advanced hypergraph neural networks with gated residual mechanisms, HERMES can effectively learn complex relationships/interactions within the data. Our results show HERMES demonstrates state-of-the-art performance, particularly in forecasting new drug combinations, significantly surpassing previous methods. This advancement underscores the potential of HERMES to facilitate more effective and precise drug combination predictions, thereby enhancing the development of novel therapeutic strategies. △ Less

Submitted 15 June, 2024; originally announced June 2024.

Comments: 8 pages, 3 figures

MSC Class: 92C50; 05C65; 68T07

arXiv:2404.10884 [pdf, other]

Modeling Interconnected Modules in Multivariate Outcomes: Evaluating the Impact of Alcohol Intake on Plasma Metabolomics

Authors: Yifan Yang, Chixiang Chen, Hwiyoung Lee, Ming Wang, Shuo Chen

Abstract: Alcohol consumption has been shown to influence cardiovascular mechanisms in humans, leading to observable alterations in the plasma metabolomic profile. Regression models are commonly employed to investigate these effects, treating metabolomics features as the outcomes and alcohol intake as the exposure. Given the latent dependence structure among the numerous metabolomic features (e.g., co-expre… ▽ More Alcohol consumption has been shown to influence cardiovascular mechanisms in humans, leading to observable alterations in the plasma metabolomic profile. Regression models are commonly employed to investigate these effects, treating metabolomics features as the outcomes and alcohol intake as the exposure. Given the latent dependence structure among the numerous metabolomic features (e.g., co-expression networks with interconnected modules), modeling this structure is crucial for accurately identifying metabolomic features associated with alcohol intake. However, integrating dependence structures into regression models remains difficult in both estimation and inference procedures due to their large or high dimensionality. To bridge this gap, we propose an innovative multivariate regression model that accounts for correlations among outcome features by incorporating an interconnected community structure. Furthermore, we derive closed-form and likelihood-based estimators, accompanied by explicit exact and explicit asymptotic covariance matrix estimators, respectively. Simulation analysis demonstrates that our approach provides accurate estimation of both dependence and regression coefficients, and enhances sensitivity while maintaining a well-controlled discovery rate, as evidenced through benchmarking against existing regression models. Finally, we apply our approach to assess the impact of alcohol intake on $249$ metabolomic biomarkers measured using nuclear magnetic resonance spectroscopy. The results indicate that alcohol intake can elevate high-density lipoprotein levels by enhancing the transport rate of Apolipoproteins A1. △ Less

Submitted 16 April, 2024; originally announced April 2024.

Comments: 25 pages, 5 figures

arXiv:2403.15291 [pdf, other]

Wastewater-based Epidemiology for COVID-19 Surveillance: A Survey

Authors: Chen Chen, Gursharn Kaur, Aniruddha Adiga, Baltazar Espinoza, Srinivasan Venkatramanan, Andrew Warren, Bryan Lewis, Justin Crow, Rekha Singh, Alexandra Lorentz, Denise Toney, Madhav Marathe

Abstract: The pandemic of COVID-19 has imposed tremendous pressure on public health systems and social economic ecosystems over the past years. To alleviate its social impact, it is important to proactively track the prevalence of COVID-19 within communities. The traditional way to estimate the disease prevalence is to estimate from reported clinical test data or surveys. However, the coverage of clinical t… ▽ More The pandemic of COVID-19 has imposed tremendous pressure on public health systems and social economic ecosystems over the past years. To alleviate its social impact, it is important to proactively track the prevalence of COVID-19 within communities. The traditional way to estimate the disease prevalence is to estimate from reported clinical test data or surveys. However, the coverage of clinical tests is often limited and the tests can be labor-intensive, requires reliable and timely results, and consistent diagnostic and reporting criteria. Recent studies revealed that patients who are diagnosed with COVID-19 often undergo fecal shedding of SARS-CoV-2 virus into wastewater, which makes wastewater-based epidemiology (WBE) for COVID-19 surveillance a promising approach to complement traditional clinical testing. In this paper, we survey the existing literature regarding WBE for COVID-19 surveillance and summarize the current advances in the area. Specifically, we have covered the key aspects of wastewater sampling, sample testing, and presented a comprehensive and organized summary of wastewater data analytical methods. Finally, we provide the open challenges on current wastewater-based COVID-19 surveillance studies, aiming to encourage new ideas to advance the development of effective wastewater-based surveillance systems for general infectious diseases. △ Less

Submitted 22 March, 2024; originally announced March 2024.

arXiv:2403.15025 [pdf, other]

Robust Conformal Prediction under Distribution Shift via Physics-Informed Structural Causal Model

Authors: Rui Xu, Yue Sun, Chao Chen, Parv Venkitasubramaniam, Sihong Xie

Abstract: Uncertainty is critical to reliable decision-making with machine learning. Conformal prediction (CP) handles uncertainty by predicting a set on a test input, hoping the set to cover the true label with at least $(1-α)$ confidence. This coverage can be guaranteed on test data even if the marginal distributions $P_X$ differ between calibration and test datasets. However, as it is common in practice,… ▽ More Uncertainty is critical to reliable decision-making with machine learning. Conformal prediction (CP) handles uncertainty by predicting a set on a test input, hoping the set to cover the true label with at least $(1-α)$ confidence. This coverage can be guaranteed on test data even if the marginal distributions $P_X$ differ between calibration and test datasets. However, as it is common in practice, when the conditional distribution $P_{Y|X}$ is different on calibration and test data, the coverage is not guaranteed and it is essential to measure and minimize the coverage loss under distributional shift at \textit{all} possible confidence levels. To address these issues, we upper bound the coverage difference at all levels using the cumulative density functions of calibration and test conformal scores and Wasserstein distance. Inspired by the invariance of physics across data distributions, we propose a physics-informed structural causal model (PI-SCM) to reduce the upper bound. We validated that PI-SCM can improve coverage robustness along confidence level and test domain on a traffic speed prediction task and an epidemic spread task with multiple real-world datasets. △ Less

Submitted 22 March, 2024; originally announced March 2024.

arXiv:2402.12655 [pdf, other]

Ego Group Partition: A Novel Framework for Improving Ego Experiments in Social Networks

Authors: Lu Deng, JingJing Zhang, Yong Wang, Chuan Chen

Abstract: Estimating the average treatment effect in social networks is challenging due to individuals influencing each other. One approach to address interference is ego cluster experiments, where each cluster consists of a central individual (ego) and its peers (alters). Clusters are randomized, and only the effects on egos are measured. In this work, we propose an improved framework for ego cluster exper… ▽ More Estimating the average treatment effect in social networks is challenging due to individuals influencing each other. One approach to address interference is ego cluster experiments, where each cluster consists of a central individual (ego) and its peers (alters). Clusters are randomized, and only the effects on egos are measured. In this work, we propose an improved framework for ego cluster experiments called ego group partition (EGP), which directly generates two groups and an ego sub-population instead of ego clusters. Under specific model assumptions, we propose two ego group partition algorithms. Compared to the original ego clustering algorithm, our algorithms produce more egos, yield smaller biases, and support parallel computation. The performance of our algorithms is validated through simulation and real-world case studies. △ Less

Submitted 19 February, 2024; originally announced February 2024.

arXiv:2402.12653 [pdf, other]

Unbiased Estimation for Total Treatment Effect Under Interference Using Aggregated Dyadic Data

Authors: Lu Deng, Yilin Li, JingJing Zhang, Yong Wang, Chuan Chen

Abstract: In social media platforms, user behavior is often influenced by interactions with other users, complicating the accurate estimation of causal effects in traditional A/B experiments. This study investigates situations where an individual's outcome can be broken down into the sum of multiple pairwise outcomes, a reflection of user interactions. These outcomes, referred to as dyadic data, are prevale… ▽ More In social media platforms, user behavior is often influenced by interactions with other users, complicating the accurate estimation of causal effects in traditional A/B experiments. This study investigates situations where an individual's outcome can be broken down into the sum of multiple pairwise outcomes, a reflection of user interactions. These outcomes, referred to as dyadic data, are prevalent in many social network contexts. Utilizing a Bernoulli randomized design, we introduce a novel unbiased estimator for the total treatment effect (TTE), which quantifies the difference in population mean when all individuals are assigned to treatment versus control groups. We further explore the bias of our estimator in scenarios where it is impractical to include all individuals in the experiment, a common constraint in online control experiments. Our numerical results reveal that our proposed estimator consistently outperforms some commonly used estimators, underscoring its potential for more precise causal effect estimation in social media environments. △ Less

Submitted 19 February, 2024; originally announced February 2024.

arXiv:2402.10062 [pdf, other]

Optimal Parameter and Neuron Pruning for Out-of-Distribution Detection

Authors: Chao Chen, Zhihang Fu, Kai Liu, Ze Chen, Mingyuan Tao, Jieping Ye

Abstract: For a machine learning model deployed in real world scenarios, the ability of detecting out-of-distribution (OOD) samples is indispensable and challenging. Most existing OOD detection methods focused on exploring advanced training skills or training-free tricks to prevent the model from yielding overconfident confidence score for unknown samples. The training-based methods require expensive traini… ▽ More For a machine learning model deployed in real world scenarios, the ability of detecting out-of-distribution (OOD) samples is indispensable and challenging. Most existing OOD detection methods focused on exploring advanced training skills or training-free tricks to prevent the model from yielding overconfident confidence score for unknown samples. The training-based methods require expensive training cost and rely on OOD samples which are not always available, while most training-free methods can not efficiently utilize the prior information from the training data. In this work, we propose an \textbf{O}ptimal \textbf{P}arameter and \textbf{N}euron \textbf{P}runing (\textbf{OPNP}) approach, which aims to identify and remove those parameters and neurons that lead to over-fitting. The main method is divided into two steps. In the first step, we evaluate the sensitivity of the model parameters and neurons by averaging gradients over all training samples. In the second step, the parameters and neurons with exceptionally large or close to zero sensitivities are removed for prediction. Our proposal is training-free, compatible with other post-hoc methods, and exploring the information from all training data. Extensive experiments are performed on multiple OOD detection tasks and model architectures, showing that our proposed OPNP consistently outperforms the existing methods by a large margin. △ Less

Submitted 4 February, 2024; originally announced February 2024.

Comments: Accepted by NeurIPS 2023. 19 pages

Journal ref: NeurIPS 2023

arXiv:2402.07134 [pdf, other]

Tail risk forecasting with semi-parametric regression models by incorporating overnight information

Authors: Cathy W. S. Chen, Takaaki Koike, Wei-Hsuan Shau

Abstract: This research incorporates realized volatility and overnight information into risk models, wherein the overnight return often contributes significantly to the total return volatility. Extending a semi-parametric regression model based on asymmetric Laplace distribution, we propose a family of RES-CAViaR-oc models by adding overnight return and realized measures as a nowcasting technique for simult… ▽ More This research incorporates realized volatility and overnight information into risk models, wherein the overnight return often contributes significantly to the total return volatility. Extending a semi-parametric regression model based on asymmetric Laplace distribution, we propose a family of RES-CAViaR-oc models by adding overnight return and realized measures as a nowcasting technique for simultaneously forecasting Value-at-Risk (VaR) and expected shortfall (ES). We utilize Bayesian methods to estimate unknown parameters and forecast VaR and ES jointly for the proposed model family. We also conduct extensive backtests based on joint elicitability of the pair of VaR and ES during the out-of sample period. Our empirical study on four international stock indices confirms that overnight return and realized volatility are vital in tail risk forecasting. △ Less

Submitted 11 February, 2024; originally announced February 2024.

arXiv:2402.01112 [pdf]

Gerontologic Biostatistics 2.0: Developments over 10+ years in the age of data science

Authors: Chixiang Chen, Michelle Shardell, Jaime Lynn Speiser, Karen Bandeen-Roche, Heather Allore, Thomas G Travison, Michael Griswold, Terrence E. Murphy

Abstract: Background: Introduced in 2010, the sub-discipline of gerontologic biostatistics (GBS) was conceptualized to address the specific challenges in analyzing data from research studies involving older adults. However, the evolving technological landscape has catalyzed data science and statistical advancements since the original GBS publication, greatly expanding the scope of gerontologic research. The… ▽ More Background: Introduced in 2010, the sub-discipline of gerontologic biostatistics (GBS) was conceptualized to address the specific challenges in analyzing data from research studies involving older adults. However, the evolving technological landscape has catalyzed data science and statistical advancements since the original GBS publication, greatly expanding the scope of gerontologic research. There is a need to describe how these advancements enhance the analysis of multi-modal data and complex phenotypes that are hallmarks of gerontologic research. Methods: This paper introduces GBS 2.0, an updated and expanded set of analytical methods reflective of the practice of gerontologic biostatistics in contemporary and future research. Results: GBS 2.0 topics and relevant software resources include cutting-edge methods in experimental design; analytical techniques that include adaptations of machine learning, quantifying deep phenotypic measurements, high-dimensional -omics analysis; the integration of information from multiple studies, and strategies to foster reproducibility, replicability, and open science. Discussion: The methodological topics presented here seek to update and expand GBS. By facilitating the synthesis of biostatistics and data science in gerontology, we aim to foster the next generation of gerontologic researchers. △ Less

Submitted 1 February, 2024; originally announced February 2024.

Comments: Corresponding Author: Michelle Shardell, PhD (Email: [email protected])

arXiv:2311.17867 [pdf, other]

A Class of Directed Acyclic Graphs with Mixed Data Types in Mediation Analysis

Authors: Wei Hao, Canyi Chen, Peter X. -K. Song

Abstract: We propose a unified class of generalized structural equation models (GSEMs) with data of mixed types in mediation analysis, including continuous, categorical, and count variables. Such models extend substantially the classical linear structural equation model to accommodate many data types arising from the application of mediation analysis. Invoking the hierarchical modeling approach, we specify… ▽ More We propose a unified class of generalized structural equation models (GSEMs) with data of mixed types in mediation analysis, including continuous, categorical, and count variables. Such models extend substantially the classical linear structural equation model to accommodate many data types arising from the application of mediation analysis. Invoking the hierarchical modeling approach, we specify GSEMs by a copula joint distribution of outcome variable, mediator and exposure variable, in which marginal distributions are built upon generalized linear models (GLMs) with confounding factors. We discuss the identifiability conditions for the causal mediation effects in the counterfactual paradigm as well as the issue of mediation leakage, and develop an asymptotically efficient profile maximum likelihood estimation and inference for two key mediation estimands, natural direct effect and natural indirect effect, in different scenarios of mixed data types. The proposed new methodology is illustrated by a motivating epidemiological study that aims to investigate whether the tempo of reaching infancy BMI peak (delay or on time), an important early life growth milestone, may mediate the association between prenatal exposure to phthalates and pubertal health outcomes. △ Less

Submitted 4 December, 2023; v1 submitted 29 November, 2023; originally announced November 2023.

Comments: 33 pages, 3 figures, 3 tables

arXiv:2310.18527 [pdf, other]

Multiple Imputation Method for High-Dimensional Neuroimaging Data

Authors: Tong Lu, Chixiang Chen, Hsin-Hsiung Huang, Peter Kochunov, Elliot Hong, Shuo Chen

Abstract: Missingness is a common issue for neuroimaging data, and neglecting it in downstream statistical analysis can introduce bias and lead to misguided inferential conclusions. It is therefore crucial to conduct appropriate statistical methods to address this issue. While multiple imputation is a popular technique for handling missing data, its application to neuroimaging data is hindered by high dimen… ▽ More Missingness is a common issue for neuroimaging data, and neglecting it in downstream statistical analysis can introduce bias and lead to misguided inferential conclusions. It is therefore crucial to conduct appropriate statistical methods to address this issue. While multiple imputation is a popular technique for handling missing data, its application to neuroimaging data is hindered by high dimensionality and complex dependence structures of multivariate neuroimaging variables. To tackle this challenge, we propose a novel approach, named High Dimensional Multiple Imputation (HIMA), based on Bayesian models. HIMA develops a new computational strategy for sampling large covariance matrices based on a robustly estimated posterior mode, which drastically enhances computational efficiency and numerical stability. To assess the effectiveness of HIMA, we conducted extensive simulation studies and real-data analysis using neuroimaging data from a Schizophrenia study. HIMA showcases a computational efficiency improvement of over 2000 times when compared to traditional approaches, while also producing imputed datasets with improved precision and stability. △ Less

Submitted 27 October, 2023; originally announced October 2023.

Comments: 13 pages, 5 figures

arXiv:2308.08217 [pdf, other]

Matching with multiple criteria and its application to health disparities research

Authors: Chang Chen, Zhiyu Qian, Bo Zhang

Abstract: Matching is a popular nonparametric covariate adjustment strategy in empirical health services research. Matching helps construct two groups comparable in many baseline covariates but different in some key aspects under investigation. In health disparities research, it is desirable to understand the contributions of various modifiable factors, like income and insurance type, to the observed dispar… ▽ More Matching is a popular nonparametric covariate adjustment strategy in empirical health services research. Matching helps construct two groups comparable in many baseline covariates but different in some key aspects under investigation. In health disparities research, it is desirable to understand the contributions of various modifiable factors, like income and insurance type, to the observed disparity in access to health services between different groups. To single out the contributions from the factors of interest, we propose a statistical matching methodology that constructs nested matched comparison groups from, for instance, White men, that resemble the target group, for instance, black men, in some selected covariates while remaining identical to the white men population before matching in the remaining covariates. Using the proposed method, we investigated the disparity gaps between white men and black men in the US in prostate-specific antigen (PSA) screening based on the 2020 Behavioral Risk Factor Surveillance System (BFRSS) database. We found a widening PSA screening rate as the white matched comparison group increasingly resembles the black men group and quantified the contribution of modifiable factors like socioeconomic status. Finally, we provide code that replicates the case study and a tutorial that enables users to design customized matched comparison groups satisfying multiple criteria. △ Less

Submitted 16 August, 2023; originally announced August 2023.

arXiv:2307.08237 [pdf, other]

doi 10.1145/3580305.3599763

A Look into Causal Effects under Entangled Treatment in Graphs: Investigating the Impact of Contact on MRSA Infection

Authors: Jing Ma, Chen Chen, Anil Vullikanti, Ritwick Mishra, Gregory Madden, Daniel Borrajo, Jundong Li

Abstract: Methicillin-resistant Staphylococcus aureus (MRSA) is a type of bacteria resistant to certain antibiotics, making it difficult to prevent MRSA infections. Among decades of efforts to conquer infectious diseases caused by MRSA, many studies have been proposed to estimate the causal effects of close contact (treatment) on MRSA infection (outcome) from observational data. In this problem, the treatme… ▽ More Methicillin-resistant Staphylococcus aureus (MRSA) is a type of bacteria resistant to certain antibiotics, making it difficult to prevent MRSA infections. Among decades of efforts to conquer infectious diseases caused by MRSA, many studies have been proposed to estimate the causal effects of close contact (treatment) on MRSA infection (outcome) from observational data. In this problem, the treatment assignment mechanism plays a key role as it determines the patterns of missing counterfactuals -- the fundamental challenge of causal effect estimation. Most existing observational studies for causal effect learning assume that the treatment is assigned individually for each unit. However, on many occasions, the treatments are pairwisely assigned for units that are connected in graphs, i.e., the treatments of different units are entangled. Neglecting the entangled treatments can impede the causal effect estimation. In this paper, we study the problem of causal effect estimation with treatment entangled in a graph. Despite a few explorations for entangled treatments, this problem still remains challenging due to the following challenges: (1) the entanglement brings difficulties in modeling and leveraging the unknown treatment assignment mechanism; (2) there may exist hidden confounders which lead to confounding biases in causal effect estimation; (3) the observational data is often time-varying. To tackle these challenges, we propose a novel method NEAT, which explicitly leverages the graph structure to model the treatment assignment mechanism, and mitigates confounding biases based on the treatment assignment modeling. We also extend our method into a dynamic setting to handle time-varying observational data. Experiments on both synthetic datasets and a real-world MRSA dataset validate the effectiveness of the proposed method, and provide insights for future applications. △ Less

Submitted 17 July, 2023; originally announced July 2023.

arXiv:2305.19947 [pdf, other]

A Geometric Perspective on Diffusion Models

Authors: Defang Chen, Zhenyu Zhou, Jian-Ping Mei, Chunhua Shen, Chun Chen, Can Wang

Abstract: Recent years have witnessed significant progress in developing effective training and fast sampling techniques for diffusion models. A remarkable advancement is the use of stochastic differential equations (SDEs) and their marginal-preserving ordinary differential equations (ODEs) to describe data perturbation and generative modeling in a unified framework. In this paper, we carefully inspect the… ▽ More Recent years have witnessed significant progress in developing effective training and fast sampling techniques for diffusion models. A remarkable advancement is the use of stochastic differential equations (SDEs) and their marginal-preserving ordinary differential equations (ODEs) to describe data perturbation and generative modeling in a unified framework. In this paper, we carefully inspect the ODE-based sampling of a popular variance-exploding SDE and reveal several intriguing structures of its sampling dynamics. We discover that the data distribution and the noise distribution are smoothly connected with a quasi-linear sampling trajectory and another implicit denoising trajectory that even converges faster. Meanwhile, the denoising trajectory governs the curvature of the corresponding sampling trajectory and its various finite differences yield all second-order samplers used in practice. Furthermore, we establish a theoretical relationship between the optimal ODE-based sampling and the classic mean-shift (mode-seeking) algorithm, with which we can characterize the asymptotic behavior of diffusion models and identify the empirical score deviation. △ Less

Submitted 30 September, 2023; v1 submitted 31 May, 2023; originally announced May 2023.

Comments: 38 pages

arXiv:2303.03520 [pdf, other]

The Effect of Alcohol Consumption on Brain Ageing: A New Causal Inference Framework for Incomplete and Massive Phenomic Data

Authors: Chixiang Chen, Shuo Chen, Zhenyao Ye, Xu Shi, Tianzhou Ma

Abstract: Although substance use, such as alcohol consumption, is known to be associated with cognitive decline during ageing, its direct influence on the central nervous system remains unclear. In this study, we aim to investigate the potential influence of alcohol intake frequency on accelerated brain ageing by estimating the mean potential brain-age gap (BAG) index, the difference between brain age and a… ▽ More Although substance use, such as alcohol consumption, is known to be associated with cognitive decline during ageing, its direct influence on the central nervous system remains unclear. In this study, we aim to investigate the potential influence of alcohol intake frequency on accelerated brain ageing by estimating the mean potential brain-age gap (BAG) index, the difference between brain age and actual age, under different alcohol intake frequencies in a large UK Biobank (UKB) cohort with extensive phenomic data reflecting a comprehensive life-style profile. We face two major challenges: (1) a large number of phenomic variables as potential confounders and (2) a small proportion of participants with complete phenomic data. To address these challenges, we first develop a new ensemble learning framework to establish robust estimation of mean potential outcome in the presence of many confounders. We then construct a data integration step to borrow information from UKB participants with incomplete phenomic data to improve efficiency. Our analysis results reveal that daily intake or even a few times a week may have significant effects on accelerating brain ageing. Moreover, extensive numerical studies demonstrate the superiority of our method over competing methods, in terms of smaller estimation bias and variability. △ Less

Submitted 4 March, 2024; v1 submitted 6 March, 2023; originally announced March 2023.

Comments: Contact: [email protected]

arXiv:2303.03512 [pdf, other]

An Efficient Data Integration Scheme for Synthesizing Information from Multiple Secondary Datasets for the Parameter Inference of the Main Analysis

Authors: Chixiang Chen, Ming Wang, Shuo Chen

Abstract: Many observational studies and clinical trials collect various secondary outcomes that may be highly correlated with the primary endpoint. These secondary outcomes are often analyzed in secondary analyses separately from the main data analysis. However, these secondary outcomes can be used to improve the estimation precision in the main analysis. We propose a method called Multiple Information Bor… ▽ More Many observational studies and clinical trials collect various secondary outcomes that may be highly correlated with the primary endpoint. These secondary outcomes are often analyzed in secondary analyses separately from the main data analysis. However, these secondary outcomes can be used to improve the estimation precision in the main analysis. We propose a method called Multiple Information Borrowing (MinBo) that borrows information from secondary data (containing secondary outcomes and covariates) to improve the efficiency of the main analysis. The proposed method is robust against model misspecification of the secondary data. Both theoretical and case studies demonstrate that MinBo outperforms existing methods in terms of efficiency gain. We apply MinBo to data from the Atherosclerosis Risk in Communities study to assess risk factors for hypertension. △ Less

Submitted 6 March, 2023; originally announced March 2023.

Comments: Contact Email: [email protected]

arXiv:2303.03502 [pdf, other]

doi 10.1002/sim.9982

Analyzing Risk Factors for Post-Acute Recovery in Older Adults with Alzheimer's Disease and Related Dementia: A New Semi-Parametric Model for Large-Scale Medicare Claims

Authors: Biyi Shen, Haoyu Ren, Michelle Shardell, Jason Falvey, Chixiang Chen

Abstract: Nearly 300,000 older adults experience a hip fracture every year, the majority of which occur following a fall. Unfortunately, recovery after fall-related trauma such as hip fracture is poor, where older adults diagnosed with Alzheimer's Disease and Related Dementia (ADRD) spend a particularly long time in hospitals or rehabilitation facilities during the post-operative recuperation period. Becaus… ▽ More Nearly 300,000 older adults experience a hip fracture every year, the majority of which occur following a fall. Unfortunately, recovery after fall-related trauma such as hip fracture is poor, where older adults diagnosed with Alzheimer's Disease and Related Dementia (ADRD) spend a particularly long time in hospitals or rehabilitation facilities during the post-operative recuperation period. Because older adults value functional recovery and spending time at home versus facilities as key outcomes after hospitalization, identifying factors that influence days spent at home after hospitalization is imperative. While several individual-level factors have been identified, the characteristics of the treating hospital have recently been identified as contributors. However, few methodological rigorous approaches are available to help overcome potential sources of bias such as hospital-level unmeasured confounders, informative hospital size, and loss to follow-up due to death. This article develops a useful tool equipped with unsupervised learning to simultaneously handle statistical complexities that are often encountered in health services research, especially when using large administrative claims databases. The proposed estimator has a closed form, thus only requiring light computation load in a large-scale study. We further develop its asymptotic properties that can be used to make statistical inference in practice. Extensive simulation studies demonstrate superiority of the proposed estimator compared to existing estimators. △ Less

Submitted 1 February, 2024; v1 submitted 6 March, 2023; originally announced March 2023.

Comments: Published on Statistics in Medicine. Contact Emails: [email protected]

arXiv:2303.03497 [pdf, other]

Integrative data analysis where partial covariates have complex non-linear effects by using summary information from an external data

Authors: Jia Liang, Shuo Chen, Peter Kochunov, L Elliot Hong, Chixiang Chen

Abstract: A full parametric and linear specification may be insufficient to capture complicated patterns in studies exploring complex features, such as those investigating age-related changes in brain functional abilities. Alternatively, a partially linear model (PLM) consisting of both parametric and non-parametric elements may have a better fit. This model has been widely applied in economics, environment… ▽ More A full parametric and linear specification may be insufficient to capture complicated patterns in studies exploring complex features, such as those investigating age-related changes in brain functional abilities. Alternatively, a partially linear model (PLM) consisting of both parametric and non-parametric elements may have a better fit. This model has been widely applied in economics, environmental science, and biomedical studies. In this paper, we introduce a novel statistical inference framework that equips PLM with high estimation efficiency by effectively synthesizing summary information from external data into the main analysis. Such an integrative scheme is versatile in assimilating various types of reduced models from the external study. The proposed method is shown to be theoretically valid and numerically convenient, and it ensures a high-efficiency gain compared to classic methods in PLM. Our method is further validated using two data applications by evaluating the risk factors of brain imaging measures and blood pressure. △ Less

Submitted 5 February, 2024; v1 submitted 6 March, 2023; originally announced March 2023.

Comments: Contact Email: chixiang.chen [at] som [dot] umaryland [dot]edu

arXiv:2302.01861 [pdf, other]

Covariance Matrix Estimation for High-Throughput Biomedical Data with Interconnected Communities

Authors: Yifan Yang, Chixiang Chen, Shuo Chen

Abstract: Estimating a covariance matrix is central to high-dimensional data analysis. Empirical analyses of high-dimensional biomedical data, including genomics, proteomics, microbiome, and neuroimaging, among others, consistently reveal strong modularity in the dependence patterns. In these analyses, intercorrelated high-dimensional biomedical features often form communities or modules that can be interco… ▽ More Estimating a covariance matrix is central to high-dimensional data analysis. Empirical analyses of high-dimensional biomedical data, including genomics, proteomics, microbiome, and neuroimaging, among others, consistently reveal strong modularity in the dependence patterns. In these analyses, intercorrelated high-dimensional biomedical features often form communities or modules that can be interconnected with others. While the interconnected community structure has been extensively studied in biomedical research (e.g., gene co-expression networks), its potential to assist in the estimation of covariance matrices remains largely unexplored. To address this gap, we propose a procedure that leverages the commonly observed interconnected community structure in high-dimensional biomedical data to estimate large covariance and precision matrices. We derive the uniformly minimum variance unbiased estimators for covariance and precision matrices in closed forms and provide theoretical results on their asymptotic properties. Our proposed method enhances the accuracy of covariance- and precision-matrix estimation and demonstrates superior performance compared to the competing methods in both simulations and real data analyses. △ Less

Submitted 15 November, 2023; v1 submitted 3 February, 2023; originally announced February 2023.

Comments: 24 pages, 3 figures

arXiv:2302.00239 [pdf, other]

Filtering Context Mitigates Scarcity and Selection Bias in Political Ideology Prediction

Authors: Chen Chen, Dylan Walker, Venkatesh Saligrama

Abstract: We propose a novel supervised learning approach for political ideology prediction (PIP) that is capable of predicting out-of-distribution inputs. This problem is motivated by the fact that manual data-labeling is expensive, while self-reported labels are often scarce and exhibit significant selection bias. We propose a novel statistical model that decomposes the document embeddings into a linear s… ▽ More We propose a novel supervised learning approach for political ideology prediction (PIP) that is capable of predicting out-of-distribution inputs. This problem is motivated by the fact that manual data-labeling is expensive, while self-reported labels are often scarce and exhibit significant selection bias. We propose a novel statistical model that decomposes the document embeddings into a linear superposition of two vectors; a latent neutral \emph{context} vector independent of ideology, and a latent \emph{position} vector aligned with ideology. We train an end-to-end model that has intermediate contextual and positional vectors as outputs. At deployment time, our model predicts labels for input documents by exclusively leveraging the predicted positional vectors. On two benchmark datasets we show that our model is capable of outputting predictions even when trained with as little as 5\% biased data, and is significantly more accurate than the state-of-the-art. Through crowd-sourcing we validate the neutrality of contextual vectors, and show that context filtering results in ideological concentration, allowing for prediction on out-of-distribution examples. △ Less

Submitted 31 January, 2023; originally announced February 2023.

arXiv:2211.15849 [pdf, other]

Association between author metadata and acceptance: A feature-rich, matched observational study of a corpus of ICLR submissions between 2017-2022

Authors: Chang Chen, Jiayao Zhang, Dan Roth, Ting Ye, Bo Zhang

Abstract: Many recent studies have probed status bias in the peer-review process of academic journals and conferences. In this article, we investigated the association between author metadata and area chairs' final decisions (Accept/Reject) using our compiled database of 5,313 borderline submissions to the International Conference on Learning Representations (ICLR) from 2017 to 2022. We carefully defined el… ▽ More Many recent studies have probed status bias in the peer-review process of academic journals and conferences. In this article, we investigated the association between author metadata and area chairs' final decisions (Accept/Reject) using our compiled database of 5,313 borderline submissions to the International Conference on Learning Representations (ICLR) from 2017 to 2022. We carefully defined elements in a cause-and-effect analysis, including the treatment and its timing, pre-treatment variables, potential outcomes and causal null hypothesis of interest, all in the context of study units being textual data and under Neyman and Rubin's potential outcomes (PO) framework. We found some weak evidence that author metadata was associated with articles' final decisions. We also found that, under an additional stability assumption, borderline articles from high-ranking institutions (top-30% or top-20%) were less favored by area chairs compared to their matched counterparts. The results were consistent in two different matched designs (odds ratio = 0.82 [95% CI: 0.67 to 1.00] in a first design and 0.83 [95% CI: 0.64 to 1.07] in a strengthened design). We discussed how to interpret these results in the context of multiple interactions between a study unit and different agents (reviewers and area chairs) in the peer-review system. △ Less

Submitted 28 November, 2022; originally announced November 2022.

arXiv:2211.15770 [pdf, other]

doi 10.3389/fams.2023.1153184

Accelerated Nonnegative Tensor Completion via Integer Programming

Authors: Wenhao Pan, Anil Aswani, Chen Chen

Abstract: The problem of tensor completion has applications in healthcare, computer vision, and other domains. However, past approaches to tensor completion have faced a tension in that they either have polynomial-time computation but require exponentially more samples than the information-theoretic rate, or they use fewer samples but require solving NP-hard problems for which there are no known practical a… ▽ More The problem of tensor completion has applications in healthcare, computer vision, and other domains. However, past approaches to tensor completion have faced a tension in that they either have polynomial-time computation but require exponentially more samples than the information-theoretic rate, or they use fewer samples but require solving NP-hard problems for which there are no known practical algorithms. A recent approach, based on integer programming, resolves this tension for nonnegative tensor completion. It achieves the information-theoretic sample complexity rate and deploys the Blended Conditional Gradients algorithm, which requires a linear (in numerical tolerance) number of oracle steps to converge to the global optimum. The tradeoff in this approach is that, in the worst case, the oracle step requires solving an integer linear program. Despite this theoretical limitation, numerical experiments show that this algorithm can, on certain instances, scale up to 100 million entries while running on a personal computer. The goal of this paper is to further enhance this algorithm, with the intention to expand both the breadth and scale of instances that can be solved. We explore several variants that can maintain the same theoretical guarantees as the algorithm, but offer potentially faster computation. We consider different data structures, acceleration of gradient descent steps, and the use of the Blended Pairwise Conditional Gradients algorithm. We describe the original approach and these variants, and conduct numerical experiments in order to explore various tradeoffs in these algorithmic design choices. △ Less

Submitted 4 February, 2023; v1 submitted 28 November, 2022; originally announced November 2022.

Comments: 23 pages. Abstract accepted by Frontiers in Applied Mathematics and Statistics. Full manuscript submitted and under review

arXiv:2210.05797 [pdf, other]

Joint Modeling for Geometry and Functionality of Cerebral Cortical Surface Images

Authors: Jingjing Zou, Chi-Hua Chen, John A. D. Aston

Abstract: We propose a framework for jointly modeling the geometry and functionality in high dimensional functional surfaces. The proposed mixed effects model characterizes effects of subject-specific covariates and exogenous stimuli on functional surfaces while accounting for potential mutual-influence of their geometry and functionality. This is achieved through a computationally efficient estimation meth… ▽ More We propose a framework for jointly modeling the geometry and functionality in high dimensional functional surfaces. The proposed mixed effects model characterizes effects of subject-specific covariates and exogenous stimuli on functional surfaces while accounting for potential mutual-influence of their geometry and functionality. This is achieved through a computationally efficient estimation method that incorporates regularized estimation of the precision matrix of the random effects. We perform a thorough analysis of cerebral cortical surface structural MRI and task fMRI data from the Human Connectome Project and discover relationships between the geometric shapes of cortical surface and neuronal activation responding to task stimuli. Our findings highlight new modes of correspondence between cortical surface shape and functional activation relevant to emotion processing. △ Less

Submitted 9 February, 2023; v1 submitted 11 October, 2022; originally announced October 2022.

arXiv:2207.05945 [pdf, other]

Online Active Regression

Authors: Cheng Chen, Yi Li, Yiming Sun

Abstract: Active regression considers a linear regression problem where the learner receives a large number of data points but can only observe a small number of labels. Since online algorithms can deal with incremental training data and take advantage of low computational cost, we consider an online extension of the active regression problem: the learner receives data points one by one and immediately deci… ▽ More Active regression considers a linear regression problem where the learner receives a large number of data points but can only observe a small number of labels. Since online algorithms can deal with incremental training data and take advantage of low computational cost, we consider an online extension of the active regression problem: the learner receives data points one by one and immediately decides whether it should collect the corresponding labels. The goal is to efficiently maintain the regression of received data points with a small budget of label queries. We propose novel algorithms for this problem under $\ell_p$ loss where $p\in[1,2]$. To achieve a $(1+ε)$-approximate solution, our proposed algorithms only require $\tilde{\mathcal{O}}(ε^{-1} d \log(nκ))$ queries of labels, where $n$ is the number of data points and $κ$ is a quantity, called the condition number, of the data points. The numerical results verify our theoretical results and show that our methods have comparable performance with offline active regression algorithms. △ Less

Submitted 30 August, 2022; v1 submitted 12 July, 2022; originally announced July 2022.

Comments: A preliminary version appeared in the Proceedings of the 39th International Conference on Machine Learning (ICML 2022), PMLR 162, pp 3320--3335, 2022. v2: optimal dependence on $ε$ in query complexity

arXiv:2206.04993 [pdf, other]

The Symmetric Generalized Eigenvalue Problem as a Nash Equilibrium

Authors: Ian Gemp, Charlie Chen, Brian McWilliams

Abstract: The symmetric generalized eigenvalue problem (SGEP) is a fundamental concept in numerical linear algebra. It captures the solution of many classical machine learning problems such as canonical correlation analysis, independent components analysis, partial least squares, linear discriminant analysis, principal components and others. Despite this, most general solvers are prohibitively expensive whe… ▽ More The symmetric generalized eigenvalue problem (SGEP) is a fundamental concept in numerical linear algebra. It captures the solution of many classical machine learning problems such as canonical correlation analysis, independent components analysis, partial least squares, linear discriminant analysis, principal components and others. Despite this, most general solvers are prohibitively expensive when dealing with streaming data sets (i.e., minibatches) and research has instead concentrated on finding efficient solutions to specific problem instances. In this work, we develop a game-theoretic formulation of the top-$k$ SGEP whose Nash equilibrium is the set of generalized eigenvectors. We also present a parallelizable algorithm with guaranteed asymptotic convergence to the Nash. Current state-of-the-art methods require $O(d^2k)$ runtime complexity per iteration which is prohibitively expensive when the number of dimensions ($d$) is large. We show how to modify this parallel approach to achieve $O(dk)$ runtime complexity. Empirically we demonstrate that this resulting algorithm is able to solve a variety of SGEP problem instances including a large-scale analysis of neural network activations. △ Less

Submitted 25 April, 2023; v1 submitted 10 June, 2022; originally announced June 2022.

Comments: Published in ICLR 2023 (JAX code available as part of github.com/deepmind/eigengame)

arXiv:2206.03914 [pdf, other]

Spatio-temporal Downscaling Emulator for Regional Climate Models: a Comparative Study

Authors: Luis A. Barboza, Shu Wei Chou Chen, Marcela Alfaro Córdoba, Eric J. Alfaro, Hugo G. Hidalgo

Abstract: Regional Climate Models (RCM) describe the meso scale global atmospheric and oceanic dynamics and serve as dynamical downscaling models. In other words, RCMs use atmospheric and oceanic climate output from General Circulation Models (GCM) to develop a higher resolution climate output. They are computationally demanding and, depending on the application, require several orders of magnitude of compu… ▽ More Regional Climate Models (RCM) describe the meso scale global atmospheric and oceanic dynamics and serve as dynamical downscaling models. In other words, RCMs use atmospheric and oceanic climate output from General Circulation Models (GCM) to develop a higher resolution climate output. They are computationally demanding and, depending on the application, require several orders of magnitude of computer time more than statistical climate downscaling. In this paper we describe how to use a spatio-temporal statistical model with varying coefficients (VC), as a downscaling emulator for a RCM using varying coefficients. In order to estimate the proposed model, two options are compared: INLA, and varycoef. We set up a simulation to compare the performance of both methods for building a statistical downscaling emulator for RCM, and then show that the emulator works properly for NARCCAP data. The results show that the model is able to estimate non-stationary marginal effects, which means that the downscaling output can vary over space. Furthermore, the model has flexibility to estimate the mean of any variable in space and time, and has good prediction results. INLA was the fastest method for all the cases, and the approximation with best accuracy to estimate the different parameters from the model and the posterior distribution of the response variable. △ Less

Submitted 13 March, 2023; v1 submitted 8 June, 2022; originally announced June 2022.

MSC Class: 62P12

arXiv:2206.02946 [pdf, other]

On the Convergence of Optimizing Persistent-Homology-Based Losses

Authors: Yikai Zhang, Jiachen Yao, Yusu Wang, Chao Chen

Abstract: Topological loss based on persistent homology has shown promise in various applications. A topological loss enforces the model to achieve certain desired topological property. Despite its empirical success, less is known about the optimization behavior of the loss. In fact, the topological loss involves combinatorial configurations that may oscillate during optimization. In this paper, we introduc… ▽ More Topological loss based on persistent homology has shown promise in various applications. A topological loss enforces the model to achieve certain desired topological property. Despite its empirical success, less is known about the optimization behavior of the loss. In fact, the topological loss involves combinatorial configurations that may oscillate during optimization. In this paper, we introduce a general purpose regularized topology-aware loss. We propose a novel regularization term and also modify existing topological loss. These contributions lead to a new loss function that not only enforces the model to have desired topological behavior, but also achieves satisfying convergence behavior. Our main theoretical result guarantees that the loss can be optimized efficiently, under mild assumptions. △ Less

Submitted 11 June, 2022; v1 submitted 6 June, 2022; originally announced June 2022.

arXiv:2205.12243 [pdf, other]

EBM Life Cycle: MCMC Strategies for Synthesis, Defense, and Density Modeling

Authors: Mitch Hill, Jonathan Mitchell, Chu Chen, Yuan Du, Mubarak Shah, Song-Chun Zhu

Abstract: This work presents strategies to learn an Energy-Based Model (EBM) according to the desired length of its MCMC sampling trajectories. MCMC trajectories of different lengths correspond to models with different purposes. Our experiments cover three different trajectory magnitudes and learning outcomes: 1) shortrun sampling for image generation; 2) midrun sampling for classifier-agnostic adversarial… ▽ More This work presents strategies to learn an Energy-Based Model (EBM) according to the desired length of its MCMC sampling trajectories. MCMC trajectories of different lengths correspond to models with different purposes. Our experiments cover three different trajectory magnitudes and learning outcomes: 1) shortrun sampling for image generation; 2) midrun sampling for classifier-agnostic adversarial defense; and 3) longrun sampling for principled modeling of image probability densities. To achieve these outcomes, we introduce three novel methods of MCMC initialization for negative samples used in Maximum Likelihood (ML) learning. With standard network architectures and an unaltered ML objective, our MCMC initialization methods alone enable significant performance gains across the three applications that we investigate. Our results include state-of-the-art FID scores for unnormalized image densities on the CIFAR-10 and ImageNet datasets; state-of-the-art adversarial defense on CIFAR-10 among purification methods and the first EBM defense on ImageNet; and scalable techniques for learning valid probability densities. Code for this project can be found at https://github.com/point0bar1/ebm-life-cycle. △ Less

Submitted 24 May, 2022; originally announced May 2022.

arXiv:2204.09125 [pdf]

Mobility Analysis Workflow (MAW): An accessible, interoperable, and reproducible container system for processing raw mobile data

Authors: Xiangyang Guan, Cynthia Chen, Ian Ren, Ka Yee Yeung, Ling-Hong Hung, Wes J. Lloyd

Abstract: Mobility analysis, or understanding and modeling of people's mobility patterns in terms of when, where, and how people move from one place to another, is fundamentally important as such information is the basis for large-scale investment decisions on the nation's multi-modal transportation infrastructure. Recent rise of using passively generated mobile data from mobile devices have raised question… ▽ More Mobility analysis, or understanding and modeling of people's mobility patterns in terms of when, where, and how people move from one place to another, is fundamentally important as such information is the basis for large-scale investment decisions on the nation's multi-modal transportation infrastructure. Recent rise of using passively generated mobile data from mobile devices have raised questions on using such data for capturing the mobility patterns of a population because: 1) there is a great variety of different kinds of mobile data and their respective properties are unknown; and 2) data pre-processing and analysis methods are often not explicitly reported. The high stakes involved with mobility analysis and issues associated with the passively generated mobile data call for mobility analysis (including data, methods and results) to be accessible to all, interoperable across different computing systems, reproducible and reusable by others. In this study, a container system named Mobility Analysis Workflow (MAW) that integrates data, methods and results, is developed. Built upon the containerization technology, MAW allows its users to easily create, configure, modify, execute and share their methods and results in the form of Docker containers. Tools for operationalizing MAW are also developed and made publicly available on GitHub. One use case of MAW is the comparative analysis for the impacts of different pre-processing and mobility analysis methods on inferred mobility patterns. This study finds that different pre-processing and analysis methods do have impacts on the resulting mobility patterns. The creation of MAW and a better understanding of the relationship between data, methods and resulting mobility patterns as facilitated by MAW represent an important first step toward promoting reproducibility and reusability in mobility analysis with passively-generated data. △ Less

Submitted 19 April, 2022; originally announced April 2022.

MSC Class: 91C20 ACM Class: J.6

arXiv:2203.14206 [pdf, other]

Denoising Likelihood Score Matching for Conditional Score-based Data Generation

Authors: Chen-Hao Chao, Wei-Fang Sun, Bo-Wun Cheng, Yi-Chen Lo, Chia-Che Chang, Yu-Lun Liu, Yu-Lin Chang, Chia-Ping Chen, Chun-Yi Lee

Abstract: Many existing conditional score-based data generation methods utilize Bayes' theorem to decompose the gradients of a log posterior density into a mixture of scores. These methods facilitate the training procedure of conditional score models, as a mixture of scores can be separately estimated using a score model and a classifier. However, our analysis indicates that the training objectives for the… ▽ More Many existing conditional score-based data generation methods utilize Bayes' theorem to decompose the gradients of a log posterior density into a mixture of scores. These methods facilitate the training procedure of conditional score models, as a mixture of scores can be separately estimated using a score model and a classifier. However, our analysis indicates that the training objectives for the classifier in these methods may lead to a serious score mismatch issue, which corresponds to the situation that the estimated scores deviate from the true ones. Such an issue causes the samples to be misled by the deviated scores during the diffusion process, resulting in a degraded sampling quality. To resolve it, we formulate a novel training objective, called Denoising Likelihood Score Matching (DLSM) loss, for the classifier to match the gradients of the true log likelihood density. Our experimental evidence shows that the proposed method outperforms the previous methods on both Cifar-10 and Cifar-100 benchmarks noticeably in terms of several key evaluation metrics. We thus conclude that, by adopting DLSM, the conditional scores can be accurately modeled, and the effect of the score mismatch issue is alleviated. △ Less

Submitted 27 March, 2022; originally announced March 2022.

Comments: ICLR 2022

arXiv:2202.08695 [pdf, other]

Article's Scientific Prestige: measuring the impact of individual articles in the Web of Science

Authors: Ying Chen, Thorsten Koch, Nazgul Zakiyeva, Kailiang Liu, Zhitong Xu, Chun-houh Chen, Junji Nakano, Keisuke Honda

Abstract: We performed a citation analysis on the Web of Science publications consisting of more than 63 million articles and 1.45 billion citations on 254 subjects from 1981 to 2020. We proposed the Article's Scientific Prestige (ASP) metric and compared this metric to number of citations (#Cit) and journal grade in measuring the scientific impact of individual articles in the large-scale hierarchical and… ▽ More We performed a citation analysis on the Web of Science publications consisting of more than 63 million articles and 1.45 billion citations on 254 subjects from 1981 to 2020. We proposed the Article's Scientific Prestige (ASP) metric and compared this metric to number of citations (#Cit) and journal grade in measuring the scientific impact of individual articles in the large-scale hierarchical and multi-disciplined citation network. In contrast to #Cit, ASP, that is computed based on the eigenvector centrality, considers both direct and indirect citations, and provides steady-state evaluation cross different disciplines. We found that ASP and #Cit are not aligned for most articles, with a growing mismatch amongst the less cited articles. While both metrics are reliable for evaluating the prestige of articles such as Nobel Prize winning articles, ASP tends to provide more persuasive rankings than #Cit when the articles are not highly cited. The journal grade, that is eventually determined by a few highly cited articles, is unable to properly reflect the scientific impact of individual articles. The number of references and coauthors are less relevant to scientific impact, but subjects do make a difference. △ Less

Submitted 17 February, 2022; originally announced February 2022.

arXiv:2202.01034 [pdf, other]

Diagnosing failures of fairness transfer across distribution shift in real-world medical settings

Authors: Jessica Schrouff, Natalie Harris, Oluwasanmi Koyejo, Ibrahim Alabdulmohsin, Eva Schnider, Krista Opsahl-Ong, Alex Brown, Subhrajit Roy, Diana Mincu, Christina Chen, Awa Dieng, Yuan Liu, Vivek Natarajan, Alan Karthikesalingam, Katherine Heller, Silvia Chiappa, Alexander D'Amour

Abstract: Diagnosing and mitigating changes in model fairness under distribution shift is an important component of the safe deployment of machine learning in healthcare settings. Importantly, the success of any mitigation strategy strongly depends on the structure of the shift. Despite this, there has been little discussion of how to empirically assess the structure of a distribution shift that one is enco… ▽ More Diagnosing and mitigating changes in model fairness under distribution shift is an important component of the safe deployment of machine learning in healthcare settings. Importantly, the success of any mitigation strategy strongly depends on the structure of the shift. Despite this, there has been little discussion of how to empirically assess the structure of a distribution shift that one is encountering in practice. In this work, we adopt a causal framing to motivate conditional independence tests as a key tool for characterizing distribution shifts. Using our approach in two medical applications, we show that this knowledge can help diagnose failures of fairness transfer, including cases where real-world shifts are more complex than is often assumed in the literature. Based on these results, we discuss potential remedies at each step of the machine learning pipeline. △ Less

Submitted 10 February, 2023; v1 submitted 2 February, 2022; originally announced February 2022.

Journal ref: Advances in Neural Information Processing Systems 35 (NeurIPS 2022)

arXiv:2201.09644 [pdf, other]

Multiscale Generative Models: Improving Performance of a Generative Model Using Feedback from Other Dependent Generative Models

Authors: Changyu Chen, Avinandan Bose, Shih-Fen Cheng, Arunesh Sinha

Abstract: Realistic fine-grained multi-agent simulation of real-world complex systems is crucial for many downstream tasks such as reinforcement learning. Recent work has used generative models (GANs in particular) for providing high-fidelity simulation of real-world systems. However, such generative models are often monolithic and miss out on modeling the interaction in multi-agent systems. In this work, w… ▽ More Realistic fine-grained multi-agent simulation of real-world complex systems is crucial for many downstream tasks such as reinforcement learning. Recent work has used generative models (GANs in particular) for providing high-fidelity simulation of real-world systems. However, such generative models are often monolithic and miss out on modeling the interaction in multi-agent systems. In this work, we take a first step towards building multiple interacting generative models (GANs) that reflects the interaction in real world. We build and analyze a hierarchical set-up where a higher-level GAN is conditioned on the output of multiple lower-level GANs. We present a technique of using feedback from the higher-level GAN to improve performance of lower-level GANs. We mathematically characterize the conditions under which our technique is impactful, including understanding the transfer learning nature of our set-up. We present three distinct experiments on synthetic data, time series data, and image domain, revealing the wide applicability of our technique. △ Less

Submitted 24 February, 2022; v1 submitted 24 January, 2022; originally announced January 2022.

arXiv:2112.09086 [pdf, ps, other]

A new locally linear embedding scheme in light of Hessian eigenmap

Authors: Liren Lin, Chih-Wei Chen

Abstract: We provide a new interpretation of Hessian locally linear embedding (HLLE), revealing that it is essentially a variant way to implement the same idea of locally linear embedding (LLE). Based on the new interpretation, a substantial simplification can be made, in which the idea of "Hessian" is replaced by rather arbitrary weights. Moreover, we show by numerical examples that HLLE may produce projec… ▽ More We provide a new interpretation of Hessian locally linear embedding (HLLE), revealing that it is essentially a variant way to implement the same idea of locally linear embedding (LLE). Based on the new interpretation, a substantial simplification can be made, in which the idea of "Hessian" is replaced by rather arbitrary weights. Moreover, we show by numerical examples that HLLE may produce projection-like results when the dimension of the target space is larger than that of the data manifold, and hence one further modification concerning the manifold dimension is suggested. Combining all the observations, we finally achieve a new LLE-type method, which is called tangential LLE (TLLE). It is simpler and more robust than HLLE. △ Less

Submitted 16 December, 2021; originally announced December 2021.

Comments: 13 pages

MSC Class: 62-07

arXiv:2111.10178 [pdf, other]

Understanding Training-Data Leakage from Gradients in Neural Networks for Image Classification

Authors: Cangxiong Chen, Neill D. F. Campbell

Abstract: Federated learning of deep learning models for supervised tasks, e.g. image classification and segmentation, has found many applications: for example in human-in-the-loop tasks such as film post-production where it enables sharing of domain expertise of human artists in an efficient and effective fashion. In many such applications, we need to protect the training data from being leaked when gradie… ▽ More Federated learning of deep learning models for supervised tasks, e.g. image classification and segmentation, has found many applications: for example in human-in-the-loop tasks such as film post-production where it enables sharing of domain expertise of human artists in an efficient and effective fashion. In many such applications, we need to protect the training data from being leaked when gradients are shared in the training process due to IP or privacy concerns. Recent works have demonstrated that it is possible to reconstruct the training data from gradients for an image-classification model when its architecture is known. However, there is still an incomplete theoretical understanding of the efficacy and failure of such attacks. In this paper, we analyse the source of training-data leakage from gradients. We formulate the problem of training data reconstruction as solving an optimisation problem iteratively for each layer. The layer-wise objective function is primarily defined by weights and gradients from the current layer as well as the output from the reconstruction of the subsequent layer, but it might also involve a 'pull-back' constraint from the preceding layer. Training data can be reconstructed when we solve the problem backward from the output of the network through each layer. Based on this formulation, we are able to attribute the potential leakage of the training data in a deep network to its architecture. We also propose a metric to measure the level of security of a deep learning model against gradient-based attacks on the training data. △ Less

Submitted 19 November, 2021; originally announced November 2021.

arXiv:2111.08906 [pdf, other]

Using Sampling to Estimate and Improve Performance of Automated Scoring Systems with Guarantees

Authors: Yaman Kumar Singla, Sriram Krishna, Rajiv Ratn Shah, Changyou Chen

Abstract: Automated Scoring (AS), the natural language processing task of scoring essays and speeches in an educational testing setting, is growing in popularity and being deployed across contexts from government examinations to companies providing language proficiency services. However, existing systems either forgo human raters entirely, thus harming the reliability of the test, or score every response by… ▽ More Automated Scoring (AS), the natural language processing task of scoring essays and speeches in an educational testing setting, is growing in popularity and being deployed across contexts from government examinations to companies providing language proficiency services. However, existing systems either forgo human raters entirely, thus harming the reliability of the test, or score every response by both human and machine thereby increasing costs. We target the spectrum of possible solutions in between, making use of both humans and machines to provide a higher quality test while keeping costs reasonable to democratize access to AS. In this work, we propose a combination of the existing paradigms, sampling responses to be scored by humans intelligently. We propose reward sampling and observe significant gains in accuracy (19.80% increase on average) and quadratic weighted kappa (QWK) (25.60% on average) with a relatively small human budget (30% samples) using our proposed sampling. The accuracy increase observed using standard random and importance sampling baselines are 8.6% and 12.2% respectively. Furthermore, we demonstrate the system's model agnostic nature by measuring its performance on a variety of models currently deployed in an AS setting as well as pseudo models. Finally, we propose an algorithm to estimate the accuracy/QWK with statistical guarantees (Our code is available at https://git.io/J1IOy). △ Less

Submitted 17 November, 2021; originally announced November 2021.

arXiv:2111.07052

Distribution and Determinants of Correlation between PM2.5 and O3 in China Mainland: Dynamitic simil-Hu Lines

Authors: Chenru Chen, Miaoqing Xu, Shuyi Liu, Dehai Zhu, Jianyu Yang, Bingbo Gao, Ziyue Chen

Abstract: In recent years, China has made great efforts to control air pollution. During the governance process, it is found that fine particulate matter (PM2.5) and ozone (O3) change in the same trend among some areas and the opposite in others, which brings some difficulties to take measures in a planned way. Therefore, this study adopted multi-year and large-scale air quality data to explore the distribu… ▽ More In recent years, China has made great efforts to control air pollution. During the governance process, it is found that fine particulate matter (PM2.5) and ozone (O3) change in the same trend among some areas and the opposite in others, which brings some difficulties to take measures in a planned way. Therefore, this study adopted multi-year and large-scale air quality data to explore the distribution of correlation between PM2.5 and O3, and proposed a concept called dynamic similar hu lines to replace the single fixed division in the previous research. Furthermore, this study discussed the causes of distribution patterns quantitatively with geographical detector and random forest. The causes included natural factors and anthropogenic factors. And these factors could be divided into three parts according to the characteristics of spatial distribution: broadly changing with longitude, changing with latitude, and having local characteristics. Overall, regions with relatively more densely population, higher GDP, lower altitude, higher humidity, higher atmospheric pressure, higher surface temperature, less sunshine hours and more accumulated precipitation often corresponds to positive correlation coefficient between PM2.5 and O3, no matter in which season. The parts with opposite conditions that mentioned above are essentially negative correlation coefficient. And what's more, humidity, global surface temperature, air temperature and accumulated precipitation are four decisive factors to form the distribution of correlation between PM2.5 and O3. In general, collaborative governance of atmospheric pollutants should consider particular time and space background and also be based on the local actual socio-economic situations, geography and geomorphology, climate and meteorology and other comprehensive factors. △ Less

Submitted 30 September, 2022; v1 submitted 13 November, 2021; originally announced November 2021.

Comments: Our research group have decided to withdraw this preprint

arXiv:2111.04580 [pdf, other]

Nonnegative Tensor Completion via Integer Optimization

Authors: Caleb Bugg, Chen Chen, Anil Aswani

Abstract: Unlike matrix completion, tensor completion does not have an algorithm that is known to achieve the information-theoretic sample complexity rate. This paper develops a new algorithm for the special case of completion for nonnegative tensors. We prove that our algorithm converges in a linear (in numerical tolerance) number of oracle steps, while achieving the information-theoretic rate. Our approac… ▽ More Unlike matrix completion, tensor completion does not have an algorithm that is known to achieve the information-theoretic sample complexity rate. This paper develops a new algorithm for the special case of completion for nonnegative tensors. We prove that our algorithm converges in a linear (in numerical tolerance) number of oracle steps, while achieving the information-theoretic rate. Our approach is to define a new norm for nonnegative tensors using the gauge of a particular 0-1 polytope; integer linear programming can, in turn, be used to solve linear separation problems over this polytope. We combine this insight with a variant of the Frank-Wolfe algorithm to construct our numerical algorithm, and we demonstrate its effectiveness and scalability through computational experiments using a laptop on tensors with up to one-hundred million entries. △ Less

Submitted 23 May, 2022; v1 submitted 8 November, 2021; originally announced November 2021.

arXiv:2110.09360 [pdf, other]

Prediction of liquid fuel properties using machine learning models with Gaussian processes and probabilistic conditional generative learning

Authors: Rodolfo S. M. Freitas, Ágatha P. F. Lima, Cheng Chen, Fernando A. Rochinha, Daniel Mira, Xi Jiang

Abstract: Accurate determination of fuel properties of complex mixtures over a wide range of pressure and temperature conditions is essential to utilizing alternative fuels. The present work aims to construct cheap-to-compute machine learning (ML) models to act as closure equations for predicting the physical properties of alternative fuels. Those models can be trained using the database from MD simulations… ▽ More Accurate determination of fuel properties of complex mixtures over a wide range of pressure and temperature conditions is essential to utilizing alternative fuels. The present work aims to construct cheap-to-compute machine learning (ML) models to act as closure equations for predicting the physical properties of alternative fuels. Those models can be trained using the database from MD simulations and/or experimental measurements in a data-fusion-fidelity approach. Here, Gaussian Process (GP) and probabilistic generative models are adopted. GP is a popular non-parametric Bayesian approach to build surrogate models mainly due to its capacity to handle the aleatory and epistemic uncertainties. Generative models have shown the ability of deep neural networks employed with the same intent. In this work, ML analysis is focused on a particular property, the fuel density, but it can also be extended to other physicochemical properties. This study explores the versatility of the ML models to handle multi-fidelity data. The results show that ML models can predict accurately the fuel properties of a wide range of pressure and temperature conditions. △ Less

Submitted 18 October, 2021; originally announced October 2021.

Comments: 22 pages, 13 figures

arXiv:2109.10957 [pdf, other]

Real Robot Challenge: A Robotics Competition in the Cloud

Authors: Stefan Bauer, Felix Widmaier, Manuel Wüthrich, Annika Buchholz, Sebastian Stark, Anirudh Goyal, Thomas Steinbrenner, Joel Akpo, Shruti Joshi, Vincent Berenz, Vaibhav Agrawal, Niklas Funk, Julen Urain De Jesus, Jan Peters, Joe Watson, Claire Chen, Krishnan Srinivasan, Junwu Zhang, Jeffrey Zhang, Matthew R. Walter, Rishabh Madan, Charles Schaff, Takahiro Maeda, Takuma Yoneda, Denis Yarats , et al. (17 additional authors not shown)

Abstract: Dexterous manipulation remains an open problem in robotics. To coordinate efforts of the research community towards tackling this problem, we propose a shared benchmark. We designed and built robotic platforms that are hosted at MPI for Intelligent Systems and can be accessed remotely. Each platform consists of three robotic fingers that are capable of dexterous object manipulation. Users are able… ▽ More Dexterous manipulation remains an open problem in robotics. To coordinate efforts of the research community towards tackling this problem, we propose a shared benchmark. We designed and built robotic platforms that are hosted at MPI for Intelligent Systems and can be accessed remotely. Each platform consists of three robotic fingers that are capable of dexterous object manipulation. Users are able to control the platforms remotely by submitting code that is executed automatically, akin to a computational cluster. Using this setup, i) we host robotics competitions, where teams from anywhere in the world access our platforms to tackle challenging tasks ii) we publish the datasets collected during these competitions (consisting of hundreds of robot hours), and iii) we give researchers access to these platforms for their own projects. △ Less

Submitted 10 June, 2022; v1 submitted 22 September, 2021; originally announced September 2021.

arXiv:2108.07301 [pdf, other]

Understanding the factors driving the opioid epidemic using machine learning

Authors: Sachin Gavali, Chuming Chen, Julie Cowart, Xi Peng, Shanshan Ding, Cathy Wu, Tammy Anderson

Abstract: In recent years, the US has experienced an opioid epidemic with an unprecedented number of drugs overdose deaths. Research finds such overdose deaths are linked to neighborhood-level traits, thus providing opportunity to identify effective interventions. Typically, techniques such as Ordinary Least Squares (OLS) or Maximum Likelihood Estimation (MLE) are used to document neighborhood-level factors… ▽ More In recent years, the US has experienced an opioid epidemic with an unprecedented number of drugs overdose deaths. Research finds such overdose deaths are linked to neighborhood-level traits, thus providing opportunity to identify effective interventions. Typically, techniques such as Ordinary Least Squares (OLS) or Maximum Likelihood Estimation (MLE) are used to document neighborhood-level factors significant in explaining such adverse outcomes. These techniques are, however, less equipped to ascertain non-linear relationships between confounding factors. Hence, in this study we apply machine learning based techniques to identify opioid risks of neighborhoods in Delaware and explore the correlation of these factors using Shapley Additive explanations (SHAP). We discovered that the factors related to neighborhoods environment, followed by education and then crime, were highly correlated with higher opioid risk. We also explored the change in these correlations over the years to understand the changing dynamics of the epidemic. Furthermore, we discovered that, as the epidemic has shifted from legal (i.e., prescription opioids) to illegal (e.g.,heroin and fentanyl) drugs in recent years, the correlation of environment, crime and health related variables with the opioid risk has increased significantly while the correlation of economic and socio-demographic variables has decreased. The correlation of education related factors has been higher from the start and has increased slightly in recent years suggesting a need for increased awareness about the opioid epidemic. △ Less

Submitted 6 December, 2021; v1 submitted 16 August, 2021; originally announced August 2021.

Comments: Accepted to IEEE International Conference on Bioinformatics & Biomedicine 2021

arXiv:2104.13417 [pdf, other]

Towards Fair Federated Learning with Zero-Shot Data Augmentation

Authors: Weituo Hao, Mostafa El-Khamy, Jungwon Lee, Jianyi Zhang, Kevin J Liang, Changyou Chen, Lawrence Carin

Abstract: Federated learning has emerged as an important distributed learning paradigm, where a server aggregates a global model from many client-trained models while having no access to the client data. Although it is recognized that statistical heterogeneity of the client local data yields slower global model convergence, it is less commonly recognized that it also yields a biased federated global model w… ▽ More Federated learning has emerged as an important distributed learning paradigm, where a server aggregates a global model from many client-trained models while having no access to the client data. Although it is recognized that statistical heterogeneity of the client local data yields slower global model convergence, it is less commonly recognized that it also yields a biased federated global model with a high variance of accuracy across clients. In this work, we aim to provide federated learning schemes with improved fairness. To tackle this challenge, we propose a novel federated learning system that employs zero-shot data augmentation on under-represented data to mitigate statistical heterogeneity and encourage more uniform accuracy performance across clients in federated networks. We study two variants of this scheme, Fed-ZDAC (federated learning with zero-shot data augmentation at the clients) and Fed-ZDAS (federated learning with zero-shot data augmentation at the server). Empirical results on a suite of datasets demonstrate the effectiveness of our methods on simultaneously improving the test accuracy and fairness. △ Less

Submitted 27 April, 2021; originally announced April 2021.

Comments: Accepted by IEEE CVPR Workshop on Fair, Data Efficient And Trusted Computer Vision

arXiv:2103.11251 [pdf, other]

Interpretable Machine Learning: Fundamental Principles and 10 Grand Challenges

Authors: Cynthia Rudin, Chaofan Chen, Zhi Chen, Haiyang Huang, Lesia Semenova, Chudi Zhong

Abstract: Interpretability in machine learning (ML) is crucial for high stakes decisions and troubleshooting. In this work, we provide fundamental principles for interpretable ML, and dispel common misunderstandings that dilute the importance of this crucial topic. We also identify 10 technical challenge areas in interpretable machine learning and provide history and background on each problem. Some of thes… ▽ More Interpretability in machine learning (ML) is crucial for high stakes decisions and troubleshooting. In this work, we provide fundamental principles for interpretable ML, and dispel common misunderstandings that dilute the importance of this crucial topic. We also identify 10 technical challenge areas in interpretable machine learning and provide history and background on each problem. Some of these problems are classically important, and some are recent problems that have arisen in the last few years. These problems are: (1) Optimizing sparse logical models such as decision trees; (2) Optimization of scoring systems; (3) Placing constraints into generalized additive models to encourage sparsity and better interpretability; (4) Modern case-based reasoning, including neural networks and matching for causal inference; (5) Complete supervised disentanglement of neural networks; (6) Complete or even partial unsupervised disentanglement of neural networks; (7) Dimensionality reduction for data visualization; (8) Machine learning models that can incorporate physics and other generative or causal constraints; (9) Characterization of the "Rashomon set" of good models; and (10) Interpretable reinforcement learning. This survey is suitable as a starting point for statisticians and computer scientists interested in working in interpretable machine learning. △ Less

Submitted 9 July, 2021; v1 submitted 20 March, 2021; originally announced March 2021.

MSC Class: 68T01 ACM Class: I.2.6

Journal ref: Statistics Surveys, 2021

arXiv:2103.07756 [pdf, other]

Learning with Feature-Dependent Label Noise: A Progressive Approach

Authors: Yikai Zhang, Songzhu Zheng, Pengxiang Wu, Mayank Goswami, Chao Chen

Abstract: Label noise is frequently observed in real-world large-scale datasets. The noise is introduced due to a variety of reasons; it is heterogeneous and feature-dependent. Most existing approaches to handling noisy labels fall into two categories: they either assume an ideal feature-independent noise, or remain heuristic without theoretical guarantees. In this paper, we propose to target a new family o… ▽ More Label noise is frequently observed in real-world large-scale datasets. The noise is introduced due to a variety of reasons; it is heterogeneous and feature-dependent. Most existing approaches to handling noisy labels fall into two categories: they either assume an ideal feature-independent noise, or remain heuristic without theoretical guarantees. In this paper, we propose to target a new family of feature-dependent label noise, which is much more general than commonly used i.i.d. label noise and encompasses a broad spectrum of noise patterns. Focusing on this general noise family, we propose a progressive label correction algorithm that iteratively corrects labels and refines the model. We provide theoretical guarantees showing that for a wide variety of (unknown) noise patterns, a classifier trained with this strategy converges to be consistent with the Bayes classifier. In experiments, our method outperforms SOTA baselines and is robust to various noise types and levels. △ Less

Submitted 27 March, 2021; v1 submitted 13 March, 2021; originally announced March 2021.

Comments: ICLR 2021 (Spotlight)

arXiv:2103.02156 [pdf, other]

Ridge-penalized adaptive Mantel test and its application in imaging genetics

Authors: Dustin Pluta, Tong Shen, Gui Xue, Chuansheng Chen, Hernando Ombao, Zhaoxia Yu

Abstract: We propose a ridge-penalized adaptive Mantel test (AdaMant) for evaluating the association of two high-dimensional sets of features. By introducing a ridge penalty, AdaMant tests the association across many metrics simultaneously. We demonstrate how ridge penalization bridges Euclidean and Mahalanobis distances and their corresponding linear models from the perspective of association measurement a… ▽ More We propose a ridge-penalized adaptive Mantel test (AdaMant) for evaluating the association of two high-dimensional sets of features. By introducing a ridge penalty, AdaMant tests the association across many metrics simultaneously. We demonstrate how ridge penalization bridges Euclidean and Mahalanobis distances and their corresponding linear models from the perspective of association measurement and testing. This result is not only theoretically interesting but also has important implications in penalized hypothesis testing, especially in high dimensional settings such as imaging genetics. Applying the proposed method to an imaging genetic study of visual working memory in health adults, we identified interesting associations of brain connectivity (measured by EEG coherence) with selected genetic features. △ Less

Submitted 20 March, 2021; v1 submitted 2 March, 2021; originally announced March 2021.

arXiv:2102.05274 [pdf, ps, other]

Stability of SGD: Tightness Analysis and Improved Bounds

Authors: Yikai Zhang, Wenjia Zhang, Sammy Bald, Vamsi Pingali, Chao Chen, Mayank Goswami

Abstract: Stochastic Gradient Descent (SGD) based methods have been widely used for training large-scale machine learning models that also generalize well in practice. Several explanations have been offered for this generalization performance, a prominent one being algorithmic stability [18]. However, there are no known examples of smooth loss functions for which the analysis can be shown to be tight. Furth… ▽ More Stochastic Gradient Descent (SGD) based methods have been widely used for training large-scale machine learning models that also generalize well in practice. Several explanations have been offered for this generalization performance, a prominent one being algorithmic stability [18]. However, there are no known examples of smooth loss functions for which the analysis can be shown to be tight. Furthermore, apart from the properties of the loss function, data distribution has also been shown to be an important factor in generalization performance. This raises the question: is the stability analysis of [18] tight for smooth functions, and if not, for what kind of loss functions and data distributions can the stability analysis be improved? In this paper we first settle open questions regarding tightness of bounds in the data-independent setting: we show that for general datasets, the existing analysis for convex and strongly-convex loss functions is tight, but it can be improved for non-convex loss functions. Next, we give a novel and improved data-dependent bounds: we show stability upper bounds for a large class of convex regularized loss functions, with negligible regularization parameters, and improve existing data-dependent bounds in the non-convex setting. We hope that our results will initiate further efforts to better understand the data-dependent setting under non-convex loss functions, leading to an improved understanding of the generalization abilities of deep networks. △ Less

Submitted 10 February, 2021; originally announced February 2021.

ACM Class: I.2.6; G.1.6

arXiv:2102.04124 [pdf, other]

SONIC: SOcial Network with Influencers and Communities

Authors: Cathy Yi-Hsuan Chen, Wolfgang Karl Härdle, Yegor Klochkov

Abstract: The integration of social media characteristics into an econometric framework requires modeling a high dimensional dynamic network with dimensions of parameter typically much larger than the number of observations. To cope with this problem, we introduce SONIC, a new high-dimensional network model that assumes that (1) only few influencers drive the network dynamics; (2) the community structure of… ▽ More The integration of social media characteristics into an econometric framework requires modeling a high dimensional dynamic network with dimensions of parameter typically much larger than the number of observations. To cope with this problem, we introduce SONIC, a new high-dimensional network model that assumes that (1) only few influencers drive the network dynamics; (2) the community structure of the network is characterized by homogeneity of response to specific influencers, implying their underlying similarity. An estimation procedure is proposed based on a greedy algorithm and LASSO regularization. Through theoretical study and simulations, we show that the matrix parameter can be estimated even when sample size is smaller than the size of the network. Using a novel dataset retrieved from one of leading social media platforms - StockTwits and quantifying their opinions via natural language processing, we model the opinions network dynamics among a select group of users and further detect the latent communities. With a sparsity regularization, we can identify important nodes in the network. △ Less

Submitted 8 February, 2021; originally announced February 2021.

arXiv:2101.02280 [pdf]

Independent Action Models and Prediction of Combination Treatment Effects for Response Rate, Duration of Response and Tumor Size Change in Oncology Drug Development

Authors: Linda Z. Sun, Cai, Wu, Xiaoyun, Li, Cong Chen, Emmett V. Schmidt

Abstract: An unprecedented number of new cancer targets are in development, and most are being developed in combination therapies. Early oncology development is strategically challenged in choosing the best combinations to move forward to late stage development. The most common early endpoints to be assessed in such decision-making include objective response rate, duration of response and tumor size change.… ▽ More An unprecedented number of new cancer targets are in development, and most are being developed in combination therapies. Early oncology development is strategically challenged in choosing the best combinations to move forward to late stage development. The most common early endpoints to be assessed in such decision-making include objective response rate, duration of response and tumor size change. In this paper, using independent-drug-action and Bliss-drug-independence concepts as a foundation, we introduce simple models to predict combination therapy efficacy for duration of response and tumor size change. These models complement previous publications using the independent action models (Palmer 2017, Schmidt 2020) to predict progression-free survival and objective response rate and serve as new predictive models to understand drug combinations for early endpoints. The models can be applied to predict the combination treatment effect for early endpoints given monotherapy data, or to estimate the possible effect of one monotherapy in the combination if data are available from the combination therapy and the other monotherapy. Such quantitative work facilitates efficient oncology drug development. △ Less

Submitted 6 January, 2021; originally announced January 2021.

Showing 1–50 of 229 results for author: Chen, C