Search | arXiv e-print repository

Improving the Transferability of Adversarial Examples by Feature Augmentation

Authors: Donghua Wang, Wen Yao, Tingsong Jiang, Xiaohu Zheng, Junqi Wu, Xiaoqian Chen

Abstract: Despite the success of input transformation-based attacks on boosting adversarial transferability, the performance is unsatisfying due to the ignorance of the discrepancy across models. In this paper, we propose a simple but effective feature augmentation attack (FAUG) method, which improves adversarial transferability without introducing extra computation costs. Specifically, we inject the random… ▽ More Despite the success of input transformation-based attacks on boosting adversarial transferability, the performance is unsatisfying due to the ignorance of the discrepancy across models. In this paper, we propose a simple but effective feature augmentation attack (FAUG) method, which improves adversarial transferability without introducing extra computation costs. Specifically, we inject the random noise into the intermediate features of the model to enlarge the diversity of the attack gradient, thereby mitigating the risk of overfitting to the specific model and notably amplifying adversarial transferability. Moreover, our method can be combined with existing gradient attacks to augment their performance further. Extensive experiments conducted on the ImageNet dataset across CNN and transformer models corroborate the efficacy of our method, e.g., we achieve improvement of +26.22% and +5.57% on input transformation-based attacks and combination methods, respectively. △ Less

Submitted 9 July, 2024; originally announced July 2024.

Comments: 19 pages, 4 figures, 4 tables

arXiv:2407.06688 [pdf, other]

Universal Multi-view Black-box Attack against Object Detectors via Layout Optimization

Authors: Donghua Wang, Wen Yao, Tingsong Jiang, Chao Li, Xiaoqian Chen

Abstract: Object detectors have demonstrated vulnerability to adversarial examples crafted by small perturbations that can deceive the object detector. Existing adversarial attacks mainly focus on white-box attacks and are merely valid at a specific viewpoint, while the universal multi-view black-box attack is less explored, limiting their generalization in practice. In this paper, we propose a novel univer… ▽ More Object detectors have demonstrated vulnerability to adversarial examples crafted by small perturbations that can deceive the object detector. Existing adversarial attacks mainly focus on white-box attacks and are merely valid at a specific viewpoint, while the universal multi-view black-box attack is less explored, limiting their generalization in practice. In this paper, we propose a novel universal multi-view black-box attack against object detectors, which optimizes a universal adversarial UV texture constructed by multiple image stickers for a 3D object via the designed layout optimization algorithm. Specifically, we treat the placement of image stickers on the UV texture as a circle-based layout optimization problem, whose objective is to find the optimal circle layout filled with image stickers so that it can deceive the object detector under the multi-view scenario. To ensure reasonable placement of image stickers, two constraints are elaborately devised. To optimize the layout, we adopt the random search algorithm enhanced by the devised important-aware selection strategy to find the most appropriate image sticker for each circle from the image sticker pools. Extensive experiments conducted on four common object detectors suggested that the detection performance decreases by a large magnitude of 74.29% on average in multi-view scenarios. Additionally, a novel evaluation tool based on the photo-realistic simulator is designed to assess the texture-based attack fairly. △ Less

Submitted 9 July, 2024; originally announced July 2024.

Comments: 12 pages, 13 figures, 5 tables

arXiv:2407.06043 [pdf, other]

Test-time adaptation for geospatial point cloud semantic segmentation with distinct domain shifts

Authors: Puzuo Wang, Wei Yao, Jie Shao, Zhiyi He

Abstract: Domain adaptation (DA) techniques help deep learning models generalize across data shifts for point cloud semantic segmentation (PCSS). Test-time adaptation (TTA) allows direct adaptation of a pre-trained model to unlabeled data during inference stage without access to source data or additional training, avoiding privacy issues and large computational resources. We address TTA for geospatial PCSS… ▽ More Domain adaptation (DA) techniques help deep learning models generalize across data shifts for point cloud semantic segmentation (PCSS). Test-time adaptation (TTA) allows direct adaptation of a pre-trained model to unlabeled data during inference stage without access to source data or additional training, avoiding privacy issues and large computational resources. We address TTA for geospatial PCSS by introducing three domain shift paradigms: photogrammetric to airborne LiDAR, airborne to mobile LiDAR, and synthetic to mobile laser scanning. We propose a TTA method that progressively updates batch normalization (BN) statistics with each testing batch. Additionally, a self-supervised learning module optimizes learnable BN affine parameters. Information maximization and reliability-constrained pseudo-labeling improve prediction confidence and supply supervisory signals. Experimental results show our method improves classification accuracy by up to 20\% mIoU, outperforming other methods. For photogrammetric (SensatUrban) to airborne (Hessigheim 3D) adaptation at the inference stage, our method achieves 59.46\% mIoU and 85.97\% OA without retraining or fine-turning. △ Less

Submitted 8 July, 2024; originally announced July 2024.

arXiv:2407.02830 [pdf, other]

A Radiometric Correction based Optical Modeling Approach to Removing Reflection Noise in TLS Point Clouds of Urban Scenes

Authors: Li Fang, Tianyu Li, Yanghong Lin, Shudong Zhou, Wei Yao

Abstract: Point clouds are vital in computer vision tasks such as 3D reconstruction, autonomous driving, and robotics. However, TLS-acquired point clouds often contain virtual points from reflective surfaces, causing disruptions. This study presents a reflection noise elimination algorithm for TLS point clouds. Our innovative reflection plane detection algorithm, based on geometry-optical models and physica… ▽ More Point clouds are vital in computer vision tasks such as 3D reconstruction, autonomous driving, and robotics. However, TLS-acquired point clouds often contain virtual points from reflective surfaces, causing disruptions. This study presents a reflection noise elimination algorithm for TLS point clouds. Our innovative reflection plane detection algorithm, based on geometry-optical models and physical properties, identifies and categorizes reflection points per optical reflection theory. We've adapted the LSFH feature descriptor to retain reflection features, mitigating interference from symmetrical architectural structures. By incorporating the Hausdorff feature distance, the algorithm enhances resilience to ghosting and deformation, improving virtual point detection accuracy. Extensive experiments on the 3DRN benchmark dataset, featuring diverse urban environments with virtual TLS reflection noise, show our algorithm improves precision and recall rates for 3D points in reflective regions by 57.03\% and 31.80\%, respectively. Our method achieves a 9.17\% better outlier detection rate and 5.65\% higher accuracy than leading methods. Access the 3DRN dataset at (https://github.com/Tsuiky/3DRN). △ Less

Submitted 3 July, 2024; originally announced July 2024.

arXiv:2406.18518 [pdf, other]

APIGen: Automated Pipeline for Generating Verifiable and Diverse Function-Calling Datasets

Authors: Zuxin Liu, Thai Hoang, Jianguo Zhang, Ming Zhu, Tian Lan, Shirley Kokane, Juntao Tan, Weiran Yao, Zhiwei Liu, Yihao Feng, Rithesh Murthy, Liangwei Yang, Silvio Savarese, Juan Carlos Niebles, Huan Wang, Shelby Heinecke, Caiming Xiong

Abstract: The advancement of function-calling agent models requires diverse, reliable, and high-quality datasets. This paper presents APIGen, an automated data generation pipeline designed to synthesize verifiable high-quality datasets for function-calling applications. We leverage APIGen and collect 3,673 executable APIs across 21 different categories to generate diverse function-calling datasets in a scal… ▽ More The advancement of function-calling agent models requires diverse, reliable, and high-quality datasets. This paper presents APIGen, an automated data generation pipeline designed to synthesize verifiable high-quality datasets for function-calling applications. We leverage APIGen and collect 3,673 executable APIs across 21 different categories to generate diverse function-calling datasets in a scalable and structured manner. Each data in our dataset is verified through three hierarchical stages: format checking, actual function executions, and semantic verification, ensuring its reliability and correctness. We demonstrate that models trained with our curated datasets, even with only 7B parameters, can achieve state-of-the-art performance on the Berkeley Function-Calling Benchmark, outperforming multiple GPT-4 models. Moreover, our 1B model achieves exceptional performance, surpassing GPT-3.5-Turbo and Claude-3 Haiku. We release a dataset containing 60,000 high-quality entries, aiming to advance the field of function-calling agent domains. The dataset is available on Huggingface: https://huggingface.co/datasets/Salesforce/xlam-function-calling-60k and the project homepage: https://apigen-pipeline.github.io/ △ Less

Submitted 26 June, 2024; originally announced June 2024.

arXiv:2406.12084 [pdf, other]

When Reasoning Meets Information Aggregation: A Case Study with Sports Narratives

Authors: Yebowen Hu, Kaiqiang Song, Sangwoo Cho, Xiaoyang Wang, Wenlin Yao, Hassan Foroosh, Dong Yu, Fei Liu

Abstract: Reasoning is most powerful when an LLM accurately aggregates relevant information. We examine the critical role of information aggregation in reasoning by requiring the LLM to analyze sports narratives. To succeed at this task, an LLM must infer points from actions, identify related entities, attribute points accurately to players and teams, and compile key statistics to draw conclusions. We condu… ▽ More Reasoning is most powerful when an LLM accurately aggregates relevant information. We examine the critical role of information aggregation in reasoning by requiring the LLM to analyze sports narratives. To succeed at this task, an LLM must infer points from actions, identify related entities, attribute points accurately to players and teams, and compile key statistics to draw conclusions. We conduct comprehensive experiments with real NBA basketball data and present SportsGen, a new method to synthesize game narratives. By synthesizing data, we can rigorously evaluate LLMs' reasoning capabilities under complex scenarios with varying narrative lengths and density of information. Our findings show that most models, including GPT-4o, often fail to accurately aggregate basketball scores due to frequent scoring patterns. Open-source models like Llama-3 further suffer from significant score hallucinations. Finally, the effectiveness of reasoning is influenced by narrative complexity, information density, and domain-specific terms, highlighting the challenges in analytical reasoning tasks. △ Less

Submitted 17 June, 2024; originally announced June 2024.

arXiv:2406.11739 [pdf, other]

V3Det Challenge 2024 on Vast Vocabulary and Open Vocabulary Object Detection: Methods and Results

Authors: Jiaqi Wang, Yuhang Zang, Pan Zhang, Tao Chu, Yuhang Cao, Zeyi Sun, Ziyu Liu, Xiaoyi Dong, Tong Wu, Dahua Lin, Zeming Chen, Zhi Wang, Lingchen Meng, Wenhao Yao, Jianwei Yang, Sihong Wu, Zhineng Chen, Zuxuan Wu, Yu-Gang Jiang, Peixi Wu, Bosong Chai, Xuan Nie, Longquan Yan, Zeyu Wang, Qifan Zhou , et al. (9 additional authors not shown)

Abstract: Detecting objects in real-world scenes is a complex task due to various challenges, including the vast range of object categories, and potential encounters with previously unknown or unseen objects. The challenges necessitate the development of public benchmarks and challenges to advance the field of object detection. Inspired by the success of previous COCO and LVIS Challenges, we organize the V3… ▽ More Detecting objects in real-world scenes is a complex task due to various challenges, including the vast range of object categories, and potential encounters with previously unknown or unseen objects. The challenges necessitate the development of public benchmarks and challenges to advance the field of object detection. Inspired by the success of previous COCO and LVIS Challenges, we organize the V3Det Challenge 2024 in conjunction with the 4th Open World Vision Workshop: Visual Perception via Learning in an Open World (VPLOW) at CVPR 2024, Seattle, US. This challenge aims to push the boundaries of object detection research and encourage innovation in this field. The V3Det Challenge 2024 consists of two tracks: 1) Vast Vocabulary Object Detection: This track focuses on detecting objects from a large set of 13204 categories, testing the detection algorithm's ability to recognize and locate diverse objects. 2) Open Vocabulary Object Detection: This track goes a step further, requiring algorithms to detect objects from an open set of categories, including unknown objects. In the following sections, we will provide a comprehensive summary and analysis of the solutions submitted by participants. By analyzing the methods and solutions presented, we aim to inspire future research directions in vast vocabulary and open-vocabulary object detection, driving progress in this field. Challenge homepage: https://v3det.openxlab.org.cn/challenge △ Less

Submitted 17 June, 2024; originally announced June 2024.

arXiv:2406.11592 [pdf, other]

ChildDiffusion: Unlocking the Potential of Generative AI and Controllable Augmentations for Child Facial Data using Stable Diffusion and Large Language Models

Authors: Muhammad Ali Farooq, Wang Yao, Peter Corcoran

Abstract: In this research work we have proposed high-level ChildDiffusion framework capable of generating photorealistic child facial samples and further embedding several intelligent augmentations on child facial data using short text prompts, detailed textual guidance from LLMs, and further image to image transformation using text guidance control conditioning thus providing an opportunity to curate full… ▽ More In this research work we have proposed high-level ChildDiffusion framework capable of generating photorealistic child facial samples and further embedding several intelligent augmentations on child facial data using short text prompts, detailed textual guidance from LLMs, and further image to image transformation using text guidance control conditioning thus providing an opportunity to curate fully synthetic large scale child datasets. The framework is validated by rendering high-quality child faces representing ethnicity data, micro expressions, face pose variations, eye blinking effects, facial accessories, different hair colours and styles, aging, multiple and different child gender subjects in a single frame. Addressing privacy concerns regarding child data acquisition requires a comprehensive approach that involves legal, ethical, and technological considerations. Keeping this in view this framework can be adapted to synthesise child facial data which can be effectively used for numerous downstream machine learning tasks. The proposed method circumvents common issues encountered in generative AI tools, such as temporal inconsistency and limited control over the rendered outputs. As an exemplary use case we have open-sourced child ethnicity data consisting of 2.5k child facial samples of five different classes which includes African, Asian, White, South Asian/ Indian, and Hispanic races by deploying the model in production inference phase. The rendered data undergoes rigorous qualitative as well as quantitative tests to cross validate its efficacy and further fine-tuning Yolo architecture for detecting and classifying child ethnicity as an exemplary downstream machine learning task. △ Less

Submitted 17 June, 2024; originally announced June 2024.

Comments: This work has been submitted to the IEEE Transactions Journal for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

arXiv:2406.10932 [pdf, other]

Imperceptible Rhythm Backdoor Attacks: Exploring Rhythm Transformation for Embedding Undetectable Vulnerabilities on Speech Recognition

Authors: Wenhan Yao, Jiangkun Yang, Yongqiang He, Jia Liu, Weiping Wen

Abstract: Speech recognition is an essential start ring of human-computer interaction, and recently, deep learning models have achieved excellent success in this task. However, when the model training and private data provider are always separated, some security threats that make deep neural networks (DNNs) abnormal deserve to be researched. In recent years, the typical backdoor attacks have been researched… ▽ More Speech recognition is an essential start ring of human-computer interaction, and recently, deep learning models have achieved excellent success in this task. However, when the model training and private data provider are always separated, some security threats that make deep neural networks (DNNs) abnormal deserve to be researched. In recent years, the typical backdoor attacks have been researched in speech recognition systems. The existing backdoor methods are based on data poisoning. The attacker adds some incorporated changes to benign speech spectrograms or changes the speech components, such as pitch and timbre. As a result, the poisoned data can be detected by human hearing or automatic deep algorithms. To improve the stealthiness of data poisoning, we propose a non-neural and fast algorithm called Random Spectrogram Rhythm Transformation (RSRT) in this paper. The algorithm combines four steps to generate stealthy poisoned utterances. From the perspective of rhythm component transformation, our proposed trigger stretches or squeezes the mel spectrograms and recovers them back to signals. The operation keeps timbre and content unchanged for good stealthiness. Our experiments are conducted on two kinds of speech recognition tasks, including testing the stealthiness of poisoned samples by speaker verification and automatic speech recognition. The results show that our method has excellent effectiveness and stealthiness. The rhythm trigger needs a low poisoning rate and gets a very high attack success rate. △ Less

Submitted 16 June, 2024; originally announced June 2024.

arXiv:2406.10252 [pdf, other]

AutoSurvey: Large Language Models Can Automatically Write Surveys

Authors: Yidong Wang, Qi Guo, Wenjin Yao, Hongbo Zhang, Xin Zhang, Zhen Wu, Meishan Zhang, Xinyu Dai, Min Zhang, Qingsong Wen, Wei Ye, Shikun Zhang, Yue Zhang

Abstract: This paper introduces AutoSurvey, a speedy and well-organized methodology for automating the creation of comprehensive literature surveys in rapidly evolving fields like artificial intelligence. Traditional survey paper creation faces challenges due to the vast volume and complexity of information, prompting the need for efficient survey methods. While large language models (LLMs) offer promise in… ▽ More This paper introduces AutoSurvey, a speedy and well-organized methodology for automating the creation of comprehensive literature surveys in rapidly evolving fields like artificial intelligence. Traditional survey paper creation faces challenges due to the vast volume and complexity of information, prompting the need for efficient survey methods. While large language models (LLMs) offer promise in automating this process, challenges such as context window limitations, parametric knowledge constraints, and the lack of evaluation benchmarks remain. AutoSurvey addresses these challenges through a systematic approach that involves initial retrieval and outline generation, subsection drafting by specialized LLMs, integration and refinement, and rigorous evaluation and iteration. Our contributions include a comprehensive solution to the survey problem, a reliable evaluation method, and experimental validation demonstrating AutoSurvey's effectiveness.We open our resources at \url{https://github.com/AutoSurveys/AutoSurvey}. △ Less

Submitted 17 June, 2024; v1 submitted 10 June, 2024; originally announced June 2024.

arXiv:2406.06932 [pdf, other]

Synthetic Face Ageing: Evaluation, Analysis and Facilitation of Age-Robust Facial Recognition Algorithms

Authors: Wang Yao, Muhammad Ali Farooq, Joseph Lemley, Peter Corcoran

Abstract: The ability to accurately recognize an individual's face with respect to human aging factor holds significant importance for various private as well as government sectors such as customs and public security bureaus, passport office, and national database systems. Therefore, developing a robust age-invariant face recognition system is of crucial importance to address the challenges posed by ageing… ▽ More The ability to accurately recognize an individual's face with respect to human aging factor holds significant importance for various private as well as government sectors such as customs and public security bureaus, passport office, and national database systems. Therefore, developing a robust age-invariant face recognition system is of crucial importance to address the challenges posed by ageing and maintain the reliability and accuracy of facial recognition technology. In this research work, the focus is to explore the feasibility of utilizing synthetic ageing data to improve the robustness of face recognition models that can eventually help in recognizing people at broader age intervals. To achieve this, we first design set of experiments to evaluate state-of-the-art synthetic ageing methods. In the next stage we explore the effect of age intervals on a current deep learning-based face recognition algorithm by using synthetic ageing data as well as real ageing data to perform rigorous training and validation. Moreover, these synthetic age data have been used in facilitating face recognition algorithms. Experimental results show that the recognition rate of the model trained on synthetic ageing images is 3.33% higher than the results of the baseline model when tested on images with an age gap of 40 years, which prove the potential of synthetic age data which has been quantified to enhance the performance of age-invariant face recognition systems. △ Less

Submitted 10 June, 2024; originally announced June 2024.

Comments: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

arXiv:2405.19444 [pdf, other]

MathChat: Benchmarking Mathematical Reasoning and Instruction Following in Multi-Turn Interactions

Authors: Zhenwen Liang, Dian Yu, Wenhao Yu, Wenlin Yao, Zhihan Zhang, Xiangliang Zhang, Dong Yu

Abstract: Large language models (LLMs) have demonstrated impressive capabilities in mathematical problem solving, particularly in single turn question answering formats. However, real world scenarios often involve mathematical question answering that requires multi turn or interactive information exchanges, and the performance of LLMs on these tasks is still underexplored. This paper introduces MathChat, a… ▽ More Large language models (LLMs) have demonstrated impressive capabilities in mathematical problem solving, particularly in single turn question answering formats. However, real world scenarios often involve mathematical question answering that requires multi turn or interactive information exchanges, and the performance of LLMs on these tasks is still underexplored. This paper introduces MathChat, a comprehensive benchmark specifically designed to evaluate LLMs across a broader spectrum of mathematical tasks. These tasks are structured to assess the models' abilities in multiturn interactions and open ended generation. We evaluate the performance of various SOTA LLMs on the MathChat benchmark, and we observe that while these models excel in single turn question answering, they significantly underperform in more complex scenarios that require sustained reasoning and dialogue understanding. To address the above limitations of existing LLMs when faced with multiturn and open ended tasks, we develop MathChat sync, a synthetic dialogue based math dataset for LLM finetuning, focusing on improving models' interaction and instruction following capabilities in conversations. Experimental results emphasize the need for training LLMs with diverse, conversational instruction tuning datasets like MathChatsync. We believe this work outlines one promising direction for improving the multiturn mathematical reasoning abilities of LLMs, thus pushing forward the development of LLMs that are more adept at interactive mathematical problem solving and real world applications. △ Less

Submitted 29 May, 2024; originally announced May 2024.

arXiv:2405.18777 [pdf, other]

SPABA: A Single-Loop and Probabilistic Stochastic Bilevel Algorithm Achieving Optimal Sample Complexity

Authors: Tianshu Chu, Dachuan Xu, Wei Yao, Jin Zhang

Abstract: While stochastic bilevel optimization methods have been extensively studied for addressing large-scale nested optimization problems in machine learning, it remains an open question whether the optimal complexity bounds for solving bilevel optimization are the same as those in single-level optimization. Our main result resolves this question: SPABA, an adaptation of the PAGE method for nonconvex op… ▽ More While stochastic bilevel optimization methods have been extensively studied for addressing large-scale nested optimization problems in machine learning, it remains an open question whether the optimal complexity bounds for solving bilevel optimization are the same as those in single-level optimization. Our main result resolves this question: SPABA, an adaptation of the PAGE method for nonconvex optimization in (Li et al., 2021) to the bilevel setting, can achieve optimal sample complexity in both the finite-sum and expectation settings. We show the optimality of SPABA by proving that there is no gap in complexity analysis between stochastic bilevel and single-level optimization when implementing PAGE. Notably, as indicated by the results of (Dagréou et al., 2022), there might exist a gap in complexity analysis when implementing other stochastic gradient estimators, like SGD and SAGA. In addition to SPABA, we propose several other single-loop stochastic bilevel algorithms, that either match or improve the state-of-the-art sample complexity results, leveraging our convergence rate and complexity analysis. Numerical experiments demonstrate the superior practical performance of the proposed methods. △ Less

Submitted 29 May, 2024; originally announced May 2024.

Comments: Accepted by ICML 2024

arXiv:2405.09927 [pdf, other]

Moreau Envelope for Nonconvex Bi-Level Optimization: A Single-loop and Hessian-free Solution Strategy

Authors: Risheng Liu, Zhu Liu, Wei Yao, Shangzhi Zeng, Jin Zhang

Abstract: This work focuses on addressing two major challenges in the context of large-scale nonconvex Bi-Level Optimization (BLO) problems, which are increasingly applied in machine learning due to their ability to model nested structures. These challenges involve ensuring computational efficiency and providing theoretical guarantees. While recent advances in scalable BLO algorithms have primarily relied o… ▽ More This work focuses on addressing two major challenges in the context of large-scale nonconvex Bi-Level Optimization (BLO) problems, which are increasingly applied in machine learning due to their ability to model nested structures. These challenges involve ensuring computational efficiency and providing theoretical guarantees. While recent advances in scalable BLO algorithms have primarily relied on lower-level convexity simplification, our work specifically tackles large-scale BLO problems involving nonconvexity in both the upper and lower levels. We simultaneously address computational and theoretical challenges by introducing an innovative single-loop gradient-based algorithm, utilizing the Moreau envelope-based reformulation, and providing non-asymptotic convergence analysis for general nonconvex BLO problems. Notably, our algorithm relies solely on first-order gradient information, enhancing its practicality and efficiency, especially for large-scale BLO learning tasks. We validate our approach's effectiveness through experiments on various synthetic problems, two typical hyper-parameter learning tasks, and a real-world neural architecture search application, collectively demonstrating its superior performance. △ Less

Submitted 16 May, 2024; originally announced May 2024.

Comments: Accepted by ICML 2024

arXiv:2405.08283 [pdf, other]

Vector Field-Guided Learning Predictive Control for Motion Planning of Mobile Robots with Uncertain Dynamics

Authors: Yang Lu, Weijia Yao, Yongqian Xiao, Xinglong Zhang, Xin Xu, Yaonan Wang, Dingbang Xiao

Abstract: In obstacle-dense scenarios, providing safe guidance for mobile robots is critical to improve the safe maneuvering capability. However, the guidance provided by standard guiding vector fields (GVFs) may limit the motion capability due to the improper curvature of the integral curve when traversing obstacles. On the other hand, robotic system dynamics are often time-varying, uncertain, and even unk… ▽ More In obstacle-dense scenarios, providing safe guidance for mobile robots is critical to improve the safe maneuvering capability. However, the guidance provided by standard guiding vector fields (GVFs) may limit the motion capability due to the improper curvature of the integral curve when traversing obstacles. On the other hand, robotic system dynamics are often time-varying, uncertain, and even unknown during the motion planning process. Therefore, many existing kinodynamic motion planning methods could not achieve satisfactory reliability in guaranteeing safety. To address these challenges, we propose a two-level Vector Field-guided Learning Predictive Control (VF-LPC) approach that improves safe maneuverability. The first level, the guiding level, generates safe desired trajectories using the designed kinodynamic GVF, enabling safe motion in obstacle-dense environments. The second level, the Integrated Motion Planning and Control (IMPC) level, first uses a deep Koopman operator to learn a nominal dynamics model offline and then updates the model uncertainties online using sparse Gaussian processes (GPs). The learned dynamics and a game-based safe barrier function are then incorporated into the LPC framework to generate near-optimal planning solutions. Extensive simulations and real-world experiments were conducted on quadrotor unmanned aerial vehicles and unmanned ground vehicles, demonstrating that VF-LPC enables robots to maneuver safely. △ Less

Submitted 17 July, 2024; v1 submitted 13 May, 2024; originally announced May 2024.

arXiv:2405.04861 [pdf, other]

Insights into Deep Learning Refactoring: Bridging the Gap Between Practices and Expectations

Authors: SiQi Wang, Xing Hu, Bei Wang, WenXin Yao, Xin Xia, XingYu Wang

Abstract: With the rapid development of deep learning, the implementation of intricate algorithms and substantial data processing have become standard elements of deep learning projects. As a result, the code has become progressively complex as the software evolves, which is difficult to maintain and understand. Existing studies have investigated the impact of refactoring on software quality within traditio… ▽ More With the rapid development of deep learning, the implementation of intricate algorithms and substantial data processing have become standard elements of deep learning projects. As a result, the code has become progressively complex as the software evolves, which is difficult to maintain and understand. Existing studies have investigated the impact of refactoring on software quality within traditional software. However, the insight of code refactoring in the context of deep learning is still unclear. This study endeavors to fill this knowledge gap by empirically examining the current state of code refactoring in deep learning realm, and practitioners' views on refactoring. We first manually analyzed the commit history of five popular and well-maintained deep learning projects (e.g., PyTorch). We mined 4,921 refactoring practices in historical commits and measured how different types and elements of refactoring operations are distributed and found that refactoring operation types' distribution in deep learning projects is different from it in traditional Java software. We then surveyed 159 practitioners about their views of code refactoring in deep learning projects and their expectations of current refactoring tools. The result of the survey showed that refactoring research and the development of related tools in the field of deep learning are crucial for improving project maintainability and code quality, and that current refactoring tools do not adequately meet the needs of practitioners. Lastly, we provided our perspective on the future advancement of refactoring tools and offered suggestions for developers' development practices. △ Less

Submitted 8 May, 2024; originally announced May 2024.

Comments: 24 pages, 18 figures

arXiv:2404.13692 [pdf, other]

A sustainable development perspective on urban-scale roof greening priorities and benefits

Authors: Jie Shao, Wei Yao, Lei Luo, Linzhou Zeng, Zhiyi He, Puzuo Wang, Huadong Guo

Abstract: Greenspaces are tightly linked to human well-being. Yet, rapid urbanization has exacerbated greenspace exposure inequality and declining human life quality. Roof greening has been recognized as an effective strategy to mitigate these negative impacts. Understanding priorities and benefits is crucial to promoting green roofs. Here, using geospatial big data, we conduct an urban-scale assessment of… ▽ More Greenspaces are tightly linked to human well-being. Yet, rapid urbanization has exacerbated greenspace exposure inequality and declining human life quality. Roof greening has been recognized as an effective strategy to mitigate these negative impacts. Understanding priorities and benefits is crucial to promoting green roofs. Here, using geospatial big data, we conduct an urban-scale assessment of roof greening at a single building level in Hong Kong from a sustainable development perspective. We identify that 85.3\% of buildings reveal potential and urgent demand for roof greening. We further find green roofs could increase greenspace exposure by \textasciitilde61\% and produce hundreds of millions (HK\$) in economic benefits annually but play a small role in urban heat mitigation (\textasciitilde0.15\degree{C}) and annual carbon emission offsets (\textasciitilde0.8\%). Our study offers a comprehensive assessment of roof greening, which could provide reference for sustainable development in cities worldwide, from data utilization to solutions and findings. △ Less

Submitted 21 April, 2024; originally announced April 2024.

arXiv:2404.06003 [pdf, other]

FreeEval: A Modular Framework for Trustworthy and Efficient Evaluation of Large Language Models

Authors: Zhuohao Yu, Chang Gao, Wenjin Yao, Yidong Wang, Zhengran Zeng, Wei Ye, Jindong Wang, Yue Zhang, Shikun Zhang

Abstract: The rapid development of large language model (LLM) evaluation methodologies and datasets has led to a profound challenge: integrating state-of-the-art evaluation techniques cost-effectively while ensuring reliability, reproducibility, and efficiency. Currently, there is a notable absence of a unified and adaptable framework that seamlessly integrates various evaluation approaches. Moreover, the r… ▽ More The rapid development of large language model (LLM) evaluation methodologies and datasets has led to a profound challenge: integrating state-of-the-art evaluation techniques cost-effectively while ensuring reliability, reproducibility, and efficiency. Currently, there is a notable absence of a unified and adaptable framework that seamlessly integrates various evaluation approaches. Moreover, the reliability of evaluation findings is often questionable due to potential data contamination, with the evaluation efficiency commonly overlooked when facing the substantial costs associated with LLM inference. In response to these challenges, we introduce FreeEval, a modular and scalable framework crafted to enable trustworthy and efficient automatic evaluations of LLMs. Firstly, FreeEval's unified abstractions simplify the integration and improve the transparency of diverse evaluation methodologies, encompassing dynamic evaluation that demand sophisticated LLM interactions. Secondly, the framework integrates meta-evaluation techniques like human evaluation and data contamination detection, which, along with dynamic evaluation modules in the platform, enhance the fairness of the evaluation outcomes. Lastly, FreeEval is designed with a high-performance infrastructure, including distributed computation and caching strategies, enabling extensive evaluations across multi-node, multi-GPU clusters for open-source and proprietary LLMs. △ Less

Submitted 9 April, 2024; originally announced April 2024.

Comments: We open-source all our code at: https://github.com/WisdomShell/FreeEval

arXiv:2403.08946 [pdf, other]

Usable XAI: 10 Strategies Towards Exploiting Explainability in the LLM Era

Authors: Xuansheng Wu, Haiyan Zhao, Yaochen Zhu, Yucheng Shi, Fan Yang, Tianming Liu, Xiaoming Zhai, Wenlin Yao, Jundong Li, Mengnan Du, Ninghao Liu

Abstract: Explainable AI (XAI) refers to techniques that provide human-understandable insights into the workings of AI models. Recently, the focus of XAI is being extended towards Large Language Models (LLMs) which are often criticized for their lack of transparency. This extension calls for a significant transformation in XAI methodologies because of two reasons. First, many existing XAI methods cannot be… ▽ More Explainable AI (XAI) refers to techniques that provide human-understandable insights into the workings of AI models. Recently, the focus of XAI is being extended towards Large Language Models (LLMs) which are often criticized for their lack of transparency. This extension calls for a significant transformation in XAI methodologies because of two reasons. First, many existing XAI methods cannot be directly applied to LLMs due to their complexity advanced capabilities. Second, as LLMs are increasingly deployed across diverse industry applications, the role of XAI shifts from merely opening the "black box" to actively enhancing the productivity and applicability of LLMs in real-world settings. Meanwhile, unlike traditional machine learning models that are passive recipients of XAI insights, the distinct abilities of LLMs can reciprocally enhance XAI. Therefore, in this paper, we introduce Usable XAI in the context of LLMs by analyzing (1) how XAI can benefit LLMs and AI systems, and (2) how LLMs can contribute to the advancement of XAI. We introduce 10 strategies, introducing the key techniques for each and discussing their associated challenges. We also provide case studies to demonstrate how to obtain and leverage explanations. The code used in this paper can be found at: https://github.com/JacksonWuxs/UsableXAI_LLM. △ Less

Submitted 13 March, 2024; originally announced March 2024.

Comments: 38 pages, 4 figures

arXiv:2403.06197 [pdf, other]

DrFuse: Learning Disentangled Representation for Clinical Multi-Modal Fusion with Missing Modality and Modal Inconsistency

Authors: Wenfang Yao, Kejing Yin, William K. Cheung, Jia Liu, Jing Qin

Abstract: The combination of electronic health records (EHR) and medical images is crucial for clinicians in making diagnoses and forecasting prognosis. Strategically fusing these two data modalities has great potential to improve the accuracy of machine learning models in clinical prediction tasks. However, the asynchronous and complementary nature of EHR and medical images presents unique challenges. Miss… ▽ More The combination of electronic health records (EHR) and medical images is crucial for clinicians in making diagnoses and forecasting prognosis. Strategically fusing these two data modalities has great potential to improve the accuracy of machine learning models in clinical prediction tasks. However, the asynchronous and complementary nature of EHR and medical images presents unique challenges. Missing modalities due to clinical and administrative factors are inevitable in practice, and the significance of each data modality varies depending on the patient and the prediction target, resulting in inconsistent predictions and suboptimal model performance. To address these challenges, we propose DrFuse to achieve effective clinical multi-modal fusion. It tackles the missing modality issue by disentangling the features shared across modalities and those unique within each modality. Furthermore, we address the modal inconsistency issue via a disease-wise attention layer that produces the patient- and disease-wise weighting for each modality to make the final prediction. We validate the proposed method using real-world large-scale datasets, MIMIC-IV and MIMIC-CXR. Experimental results show that the proposed method significantly outperforms the state-of-the-art models. Our implementation is publicly available at https://github.com/dorothy-yao/drfuse. △ Less

Submitted 10 March, 2024; originally announced March 2024.

Comments: Accepted by AAAI-24

arXiv:2403.02132 [pdf, other]

UB-FineNet: Urban Building Fine-grained Classification Network for Open-access Satellite Images

Authors: Zhiyi He, Wei Yao, Jie Shao, Puzuo Wang

Abstract: Fine classification of city-scale buildings from satellite remote sensing imagery is a crucial research area with significant implications for urban planning, infrastructure development, and population distribution analysis. However, the task faces big challenges due to low-resolution overhead images acquired from high altitude space-borne platforms and the long-tail sample distribution of fine-gr… ▽ More Fine classification of city-scale buildings from satellite remote sensing imagery is a crucial research area with significant implications for urban planning, infrastructure development, and population distribution analysis. However, the task faces big challenges due to low-resolution overhead images acquired from high altitude space-borne platforms and the long-tail sample distribution of fine-grained urban building categories, leading to severe class imbalance problem. To address these issues, we propose a deep network approach to fine-grained classification of urban buildings using open-access satellite images. A Denoising Diffusion Probabilistic Model (DDPM) based super-resolution method is first introduced to enhance the spatial resolution of satellite images, which benefits from domain-adaptive knowledge distillation. Then, a new fine-grained classification network with Category Information Balancing Module (CIBM) and Contrastive Supervision (CS) technique is proposed to mitigate the problem of class imbalance and improve the classification robustness and accuracy. Experiments on Hong Kong data set with 11 fine building types revealed promising classification results with a mean Top-1 accuracy of 60.45\%, which is on par with street-view image based approaches. Extensive ablation study shows that CIBM and CS improve Top-1 accuracy by 2.6\% and 3.5\% compared to the baseline method, respectively. And both modules can be easily inserted into other classification networks and similar enhancements have been achieved. Our research contributes to the field of urban analysis by providing a practical solution for fine classification of buildings in challenging mega city scenarios solely using open-access satellite images. The proposed method can serve as a valuable tool for urban planners, aiding in the understanding of economic, industrial, and population distribution. △ Less

Submitted 4 March, 2024; originally announced March 2024.

arXiv:2402.19465 [pdf, other]

Towards Tracing Trustworthiness Dynamics: Revisiting Pre-training Period of Large Language Models

Authors: Chen Qian, Jie Zhang, Wei Yao, Dongrui Liu, Zhenfei Yin, Yu Qiao, Yong Liu, Jing Shao

Abstract: Ensuring the trustworthiness of large language models (LLMs) is crucial. Most studies concentrate on fully pre-trained LLMs to better understand and improve LLMs' trustworthiness. In this paper, to reveal the untapped potential of pre-training, we pioneer the exploration of LLMs' trustworthiness during this period, focusing on five key dimensions: reliability, privacy, toxicity, fairness, and robu… ▽ More Ensuring the trustworthiness of large language models (LLMs) is crucial. Most studies concentrate on fully pre-trained LLMs to better understand and improve LLMs' trustworthiness. In this paper, to reveal the untapped potential of pre-training, we pioneer the exploration of LLMs' trustworthiness during this period, focusing on five key dimensions: reliability, privacy, toxicity, fairness, and robustness. To begin with, we apply linear probing to LLMs. The high probing accuracy suggests that \textit{LLMs in early pre-training can already distinguish concepts in each trustworthiness dimension}. Therefore, to further uncover the hidden possibilities of pre-training, we extract steering vectors from a LLM's pre-training checkpoints to enhance the LLM's trustworthiness. Finally, inspired by~\citet{choi2023understanding} that mutual information estimation is bounded by linear probing accuracy, we also probe LLMs with mutual information to investigate the dynamics of trustworthiness during pre-training. We are the first to observe a similar two-phase phenomenon: fitting and compression~\citep{shwartz2017opening}. This research provides an initial exploration of trustworthiness modeling during LLM pre-training, seeking to unveil new insights and spur further developments in the field. We will make our code publicly accessible at \url{https://github.com/ChnQ/TracingLLM}. △ Less

Submitted 29 February, 2024; originally announced February 2024.

arXiv:2402.17124 [pdf, other]

Fact-and-Reflection (FaR) Improves Confidence Calibration of Large Language Models

Authors: Xinran Zhao, Hongming Zhang, Xiaoman Pan, Wenlin Yao, Dong Yu, Tongshuang Wu, Jianshu Chen

Abstract: For a LLM to be trustworthy, its confidence level should be well-calibrated with its actual performance. While it is now common sense that LLM performances are greatly impacted by prompts, the confidence calibration in prompting LLMs has yet to be thoroughly explored. In this paper, we explore how different prompting strategies influence LLM confidence calibration and how it could be improved. We… ▽ More For a LLM to be trustworthy, its confidence level should be well-calibrated with its actual performance. While it is now common sense that LLM performances are greatly impacted by prompts, the confidence calibration in prompting LLMs has yet to be thoroughly explored. In this paper, we explore how different prompting strategies influence LLM confidence calibration and how it could be improved. We conduct extensive experiments on six prompting methods in the question-answering context and we observe that, while these methods help improve the expected LLM calibration, they also trigger LLMs to be over-confident when responding to some instances. Inspired by human cognition, we propose Fact-and-Reflection (FaR) prompting, which improves the LLM calibration in two steps. First, FaR elicits the known "facts" that are relevant to the input prompt from the LLM. And then it asks the model to "reflect" over them to generate the final answer. Experiments show that FaR prompting achieves significantly better calibration; it lowers the Expected Calibration Error by 23.5% on our multi-purpose QA tasks. Notably, FaR prompting even elicits the capability of verbally expressing concerns in less confident scenarios, which helps trigger retrieval augmentation for solving these harder instances. △ Less

Submitted 26 February, 2024; originally announced February 2024.

Comments: 17 pages, 10 figures

arXiv:2402.15538 [pdf, other]

AgentLite: A Lightweight Library for Building and Advancing Task-Oriented LLM Agent System

Authors: Zhiwei Liu, Weiran Yao, Jianguo Zhang, Liangwei Yang, Zuxin Liu, Juntao Tan, Prafulla K. Choubey, Tian Lan, Jason Wu, Huan Wang, Shelby Heinecke, Caiming Xiong, Silvio Savarese

Abstract: The booming success of LLMs initiates rapid development in LLM agents. Though the foundation of an LLM agent is the generative model, it is critical to devise the optimal reasoning strategies and agent architectures. Accordingly, LLM agent research advances from the simple chain-of-thought prompting to more complex ReAct and Reflection reasoning strategy; agent architecture also evolves from singl… ▽ More The booming success of LLMs initiates rapid development in LLM agents. Though the foundation of an LLM agent is the generative model, it is critical to devise the optimal reasoning strategies and agent architectures. Accordingly, LLM agent research advances from the simple chain-of-thought prompting to more complex ReAct and Reflection reasoning strategy; agent architecture also evolves from single agent generation to multi-agent conversation, as well as multi-LLM multi-agent group chat. However, with the existing intricate frameworks and libraries, creating and evaluating new reasoning strategies and agent architectures has become a complex challenge, which hinders research investigation into LLM agents. Thus, we open-source a new AI agent library, AgentLite, which simplifies this process by offering a lightweight, user-friendly platform for innovating LLM agent reasoning, architectures, and applications with ease. AgentLite is a task-oriented framework designed to enhance the ability of agents to break down tasks and facilitate the development of multi-agent systems. Furthermore, we introduce multiple practical applications developed with AgentLite to demonstrate its convenience and flexibility. Get started now at: \url{https://github.com/SalesforceAIResearch/AgentLite}. △ Less

Submitted 23 February, 2024; originally announced February 2024.

Comments: preprint. Library is available at https://github.com/SalesforceAIResearch/AgentLite

arXiv:2402.15506 [pdf, other]

AgentOhana: Design Unified Data and Training Pipeline for Effective Agent Learning

Authors: Jianguo Zhang, Tian Lan, Rithesh Murthy, Zhiwei Liu, Weiran Yao, Juntao Tan, Thai Hoang, Liangwei Yang, Yihao Feng, Zuxin Liu, Tulika Awalgaonkar, Juan Carlos Niebles, Silvio Savarese, Shelby Heinecke, Huan Wang, Caiming Xiong

Abstract: Autonomous agents powered by large language models (LLMs) have garnered significant research attention. However, fully harnessing the potential of LLMs for agent-based tasks presents inherent challenges due to the heterogeneous nature of diverse data sources featuring multi-turn trajectories. In this paper, we introduce \textbf{AgentOhana} as a comprehensive solution to address these challenges. \… ▽ More Autonomous agents powered by large language models (LLMs) have garnered significant research attention. However, fully harnessing the potential of LLMs for agent-based tasks presents inherent challenges due to the heterogeneous nature of diverse data sources featuring multi-turn trajectories. In this paper, we introduce \textbf{AgentOhana} as a comprehensive solution to address these challenges. \textit{AgentOhana} aggregates agent trajectories from distinct environments, spanning a wide array of scenarios. It meticulously standardizes and unifies these trajectories into a consistent format, streamlining the creation of a generic data loader optimized for agent training. Leveraging the data unification, our training pipeline maintains equilibrium across different data sources and preserves independent randomness across devices during dataset partitioning and model training. Additionally, we present \textbf{xLAM-v0.1}, a large action model tailored for AI agents, which demonstrates exceptional performance across various benchmarks. Begin the exploration at \url{https://github.com/SalesforceAIResearch/xLAM}. △ Less

Submitted 20 March, 2024; v1 submitted 23 February, 2024; originally announced February 2024.

Comments: Add GitHub repo link at \url{https://github.com/SalesforceAIResearch/xLAM} and HuggingFace model link at \url{https://huggingface.co/Salesforce/xLAM-v0.1-r}

arXiv:2402.15043 [pdf, other]

KIEval: A Knowledge-grounded Interactive Evaluation Framework for Large Language Models

Authors: Zhuohao Yu, Chang Gao, Wenjin Yao, Yidong Wang, Wei Ye, Jindong Wang, Xing Xie, Yue Zhang, Shikun Zhang

Abstract: Automatic evaluation methods for large language models (LLMs) are hindered by data contamination, leading to inflated assessments of their effectiveness. Existing strategies, which aim to detect contaminated texts, focus on quantifying contamination status instead of accurately gauging model performance. In this paper, we introduce KIEval, a Knowledge-grounded Interactive Evaluation framework, whi… ▽ More Automatic evaluation methods for large language models (LLMs) are hindered by data contamination, leading to inflated assessments of their effectiveness. Existing strategies, which aim to detect contaminated texts, focus on quantifying contamination status instead of accurately gauging model performance. In this paper, we introduce KIEval, a Knowledge-grounded Interactive Evaluation framework, which incorporates an LLM-powered "interactor" role for the first time to accomplish a dynamic contamination-resilient evaluation. Starting with a question in a conventional LLM benchmark involving domain-specific knowledge, KIEval utilizes dynamically generated, multi-round, and knowledge-focused dialogues to determine whether a model's response is merely a recall of benchmark answers or demonstrates a deep comprehension to apply knowledge in more complex conversations. Extensive experiments on seven leading LLMs across five datasets validate KIEval's effectiveness and generalization. We also reveal that data contamination brings no contribution or even negative effect to models' real-world applicability and understanding, and existing contamination detection methods for LLMs can only identify contamination in pre-training but not during supervised fine-tuning. △ Less

Submitted 3 June, 2024; v1 submitted 22 February, 2024; originally announced February 2024.

Comments: Accepted to ACL 2024 (main conference); 19 pages, 5 figures, 19 tables, code is available at: https://github.com/zhuohaoyu/KIEval

arXiv:2401.16164 [pdf, other]

Constrained Bi-Level Optimization: Proximal Lagrangian Value function Approach and Hessian-free Algorithm

Authors: Wei Yao, Chengming Yu, Shangzhi Zeng, Jin Zhang

Abstract: This paper presents a new approach and algorithm for solving a class of constrained Bi-Level Optimization (BLO) problems in which the lower-level problem involves constraints coupling both upper-level and lower-level variables. Such problems have recently gained significant attention due to their broad applicability in machine learning. However, conventional gradient-based methods unavoidably rely… ▽ More This paper presents a new approach and algorithm for solving a class of constrained Bi-Level Optimization (BLO) problems in which the lower-level problem involves constraints coupling both upper-level and lower-level variables. Such problems have recently gained significant attention due to their broad applicability in machine learning. However, conventional gradient-based methods unavoidably rely on computationally intensive calculations related to the Hessian matrix. To address this challenge, we begin by devising a smooth proximal Lagrangian value function to handle the constrained lower-level problem. Utilizing this construct, we introduce a single-level reformulation for constrained BLOs that transforms the original BLO problem into an equivalent optimization problem with smooth constraints. Enabled by this reformulation, we develop a Hessian-free gradient-based algorithm-termed proximal Lagrangian Value function-based Hessian-free Bi-level Algorithm (LV-HBA)-that is straightforward to implement in a single loop manner. Consequently, LV-HBA is especially well-suited for machine learning applications. Furthermore, we offer non-asymptotic convergence analysis for LV-HBA, eliminating the need for traditional strong convexity assumptions for the lower-level problem while also being capable of accommodating non-singleton scenarios. Empirical results substantiate the algorithm's superior practical performance. △ Less

Submitted 29 January, 2024; originally announced January 2024.

arXiv:2401.14535 [pdf, other]

CaRiNG: Learning Temporal Causal Representation under Non-Invertible Generation Process

Authors: Guangyi Chen, Yifan Shen, Zhenhao Chen, Xiangchen Song, Yuewen Sun, Weiran Yao, Xiao Liu, Kun Zhang

Abstract: Identifying the underlying time-delayed latent causal processes in sequential data is vital for grasping temporal dynamics and making downstream reasoning. While some recent methods can robustly identify these latent causal variables, they rely on strict assumptions about the invertible generation process from latent variables to observed data. However, these assumptions are often hard to satisfy… ▽ More Identifying the underlying time-delayed latent causal processes in sequential data is vital for grasping temporal dynamics and making downstream reasoning. While some recent methods can robustly identify these latent causal variables, they rely on strict assumptions about the invertible generation process from latent variables to observed data. However, these assumptions are often hard to satisfy in real-world applications containing information loss. For instance, the visual perception process translates a 3D space into 2D images, or the phenomenon of persistence of vision incorporates historical data into current perceptions. To address this challenge, we establish an identifiability theory that allows for the recovery of independent latent components even when they come from a nonlinear and non-invertible mix. Using this theory as a foundation, we propose a principled approach, CaRiNG, to learn the CAusal RepresentatIon of Non-invertible Generative temporal data with identifiability guarantees. Specifically, we utilize temporal context to recover lost latent information and apply the conditions in our theory to guide the training process. Through experiments conducted on synthetic datasets, we validate that our CaRiNG method reliably identifies the causal process, even when the generation process is non-invertible. Moreover, we demonstrate that our approach considerably improves temporal understanding and reasoning in practical applications. △ Less

Submitted 30 May, 2024; v1 submitted 25 January, 2024; originally announced January 2024.

Comments: To appear at ICML 2024, 24 pages

arXiv:2401.14013 [pdf, other]

doi 10.1109/TAC.2024.3355321

Coordinated Guiding Vector Field Design for Ordering-Flexible Multi-Robot Surface Navigation

Authors: Bin-Bin Hu, Hai-Tao Zhang, Weijia Yao, Zhiyong Sun, Ming Cao

Abstract: We design a distributed coordinated guiding vector field (CGVF) for a group of robots to achieve ordering-flexible motion coordination while maneuvering on a desired two-dimensional (2D) surface. The CGVF is characterized by three terms, i.e., a convergence term to drive the robots to converge to the desired surface, a propagation term to provide a traversing direction for maneuvering on the desir… ▽ More We design a distributed coordinated guiding vector field (CGVF) for a group of robots to achieve ordering-flexible motion coordination while maneuvering on a desired two-dimensional (2D) surface. The CGVF is characterized by three terms, i.e., a convergence term to drive the robots to converge to the desired surface, a propagation term to provide a traversing direction for maneuvering on the desired surface, and a coordinated term to achieve the surface motion coordination with an arbitrary ordering of the robotic group. By setting the surface parameters as additional virtual coordinates, the proposed approach eliminates the potential singularity of the CGVF and enables both the global convergence to the desired surface and the maneuvering on the surface from all possible initial conditions. The ordering-flexible surface motion coordination is realized by each robot to share with its neighbors only two virtual coordinates, i.e. that of a given target and that of its own, which reduces the communication and computation cost in multi-robot surface navigation. Finally, the effectiveness of the CGVF is substantiated by extensive numerical simulations. △ Less

Submitted 25 January, 2024; originally announced January 2024.

Comments: Published on IEEE Transactions on Automatic Control, 2024

arXiv:2401.13919 [pdf, other]

WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models

Authors: Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong Lan, Dong Yu

Abstract: The rapid advancement of large language models (LLMs) has led to a new era marked by the development of autonomous applications in real-world scenarios, which drives innovation in creating advanced web agents. Existing web agents typically only handle one input modality and are evaluated only in simplified web simulators or static web snapshots, greatly limiting their applicability in real-world s… ▽ More The rapid advancement of large language models (LLMs) has led to a new era marked by the development of autonomous applications in real-world scenarios, which drives innovation in creating advanced web agents. Existing web agents typically only handle one input modality and are evaluated only in simplified web simulators or static web snapshots, greatly limiting their applicability in real-world scenarios. To bridge this gap, we introduce WebVoyager, an innovative Large Multimodal Model (LMM) powered web agent that can complete user instructions end-to-end by interacting with real-world websites. Moreover, we establish a new benchmark by compiling real-world tasks from 15 popular websites and introduce an automatic evaluation protocol leveraging multimodal understanding abilities of GPT-4V to evaluate open-ended web agents. We show that WebVoyager achieves a 59.1% task success rate on our benchmark, significantly surpassing the performance of both GPT-4 (All Tools) and the WebVoyager (text-only) setups, underscoring the exceptional capability of WebVoyager. The proposed automatic evaluation metric achieves 85.3% agreement with human judgment, indicating its effectiveness in providing reliable and accurate assessments of web agents. △ Less

Submitted 6 June, 2024; v1 submitted 24 January, 2024; originally announced January 2024.

Comments: Accepted to ACL 2024 (main). Code and data is released at https://github.com/MinorJerry/WebVoyager

arXiv:2401.12870 [pdf, other]

Unlocking the Potential: Multi-task Deep Learning for Spaceborne Quantitative Monitoring of Fugitive Methane Plumes

Authors: Guoxin Si, Shiliang Fu, Wei Yao

Abstract: As global warming intensifies, increased attention is being paid to monitoring fugitive methane emissions and detecting gas plumes from landfills. We have divided methane emission monitoring into three subtasks: methane concentration inversion, plume segmentation, and emission rate estimation. Traditional algorithms face certain limitations: methane concentration inversion typically employs the ma… ▽ More As global warming intensifies, increased attention is being paid to monitoring fugitive methane emissions and detecting gas plumes from landfills. We have divided methane emission monitoring into three subtasks: methane concentration inversion, plume segmentation, and emission rate estimation. Traditional algorithms face certain limitations: methane concentration inversion typically employs the matched filter, which is sensitive to the global spectrum distribution and prone to significant noise. There is scant research on plume segmentation, with many studies depending on manual segmentation, which can be subjective. The estimation of methane emission rate frequently uses the IME algorithm, which necessitates meteorological measurement data. Utilizing the WENT landfill site in Hong Kong along with PRISMA hyperspectral satellite imagery, we introduce a novel deep learning-based framework for quantitative methane emission monitoring from remote sensing images that is grounded in physical simulation. We create simulated methane plumes using large eddy simulation (LES) and various concentration maps of fugitive emissions using the radiative transfer equation (RTE), while applying augmentation techniques to construct a simulated PRISMA dataset. We train a U-Net network for methane concentration inversion, a Mask R-CNN network for methane plume segmentation, and a ResNet-50 network for methane emission rate estimation. All three deep networks yield higher validation accuracy compared to traditional algorithms. Furthermore, we combine the first two subtasks and the last two subtasks to design multi-task learning models, MTL-01 and MTL-02, both of which outperform single-task models in terms of accuracy. Our research exemplifies the application of multi-task deep learning to quantitative methane monitoring and can be generalized to a wide array of methane monitoring tasks. △ Less

Submitted 15 July, 2024; v1 submitted 23 January, 2024; originally announced January 2024.

arXiv:2401.10495 [pdf, ps, other]

Causal Layering via Conditional Entropy

Authors: Itai Feigenbaum, Devansh Arpit, Huan Wang, Shelby Heinecke, Juan Carlos Niebles, Weiran Yao, Caiming Xiong, Silvio Savarese

Abstract: Causal discovery aims to recover information about an unobserved causal graph from the observable data it generates. Layerings are orderings of the variables which place causes before effects. In this paper, we provide ways to recover layerings of a graph by accessing the data via a conditional entropy oracle, when distributions are discrete. Our algorithms work by repeatedly removing sources or s… ▽ More Causal discovery aims to recover information about an unobserved causal graph from the observable data it generates. Layerings are orderings of the variables which place causes before effects. In this paper, we provide ways to recover layerings of a graph by accessing the data via a conditional entropy oracle, when distributions are discrete. Our algorithms work by repeatedly removing sources or sinks from the graph. Under appropriate assumptions and conditioning, we can separate the sources or sinks from the remainder of the nodes by comparing their conditional entropy to the unconditional entropy of their noise. Our algorithms are provably correct and run in worst-case quadratic time. The main assumptions are faithfulness and injective noise, and either known noise entropies or weakly monotonically increasing noise entropies along directed paths. In addition, we require one of either a very mild extension of faithfulness, or strictly monotonically increasing noise entropies, or expanding noise injectivity to include an additional single argument in the structural functions. △ Less

Submitted 19 January, 2024; originally announced January 2024.

arXiv:2401.08734 [pdf, other]

Bag of Tricks to Boost Adversarial Transferability

Authors: Zeliang Zhang, Wei Yao, Xiaosen Wang

Abstract: Deep neural networks are widely known to be vulnerable to adversarial examples. However, vanilla adversarial examples generated under the white-box setting often exhibit low transferability across different models. Since adversarial transferability poses more severe threats to practical applications, various approaches have been proposed for better transferability, including gradient-based, input… ▽ More Deep neural networks are widely known to be vulnerable to adversarial examples. However, vanilla adversarial examples generated under the white-box setting often exhibit low transferability across different models. Since adversarial transferability poses more severe threats to practical applications, various approaches have been proposed for better transferability, including gradient-based, input transformation-based, and model-related attacks, \etc. In this work, we find that several tiny changes in the existing adversarial attacks can significantly affect the attack performance, \eg, the number of iterations and step size. Based on careful studies of existing adversarial attacks, we propose a bag of tricks to enhance adversarial transferability, including momentum initialization, scheduled step size, dual example, spectral-based input transformation, and several ensemble strategies. Extensive experiments on the ImageNet dataset validate the high effectiveness of our proposed tricks and show that combining them can further boost adversarial transferability. Our work provides practical insights and techniques to enhance adversarial transferability, and offers guidance to improve the attack performance on the real-world application through simple adjustments. △ Less

Submitted 19 July, 2024; v1 submitted 16 January, 2024; originally announced January 2024.

arXiv:2401.07526 [pdf, other]

Editing Arbitrary Propositions in LLMs without Subject Labels

Authors: Itai Feigenbaum, Devansh Arpit, Huan Wang, Shelby Heinecke, Juan Carlos Niebles, Weiran Yao, Caiming Xiong, Silvio Savarese

Abstract: Large Language Model (LLM) editing modifies factual information in LLMs. Locate-and-Edit (L\&E) methods accomplish this by finding where relevant information is stored within the neural network, and editing the weights at that location. The goal of editing is to modify the response of an LLM to a proposition independently of its phrasing, while not modifying its response to other related propositi… ▽ More Large Language Model (LLM) editing modifies factual information in LLMs. Locate-and-Edit (L\&E) methods accomplish this by finding where relevant information is stored within the neural network, and editing the weights at that location. The goal of editing is to modify the response of an LLM to a proposition independently of its phrasing, while not modifying its response to other related propositions. Existing methods are limited to binary propositions, which represent straightforward binary relations between a subject and an object. Furthermore, existing methods rely on semantic subject labels, which may not be available or even be well-defined in practice. In this paper, we show that both of these issues can be effectively skirted with a simple and fast localization method called Gradient Tracing (GT). This localization method allows editing arbitrary propositions instead of just binary ones, and does so without the need for subject labels. As propositions always have a truth value, our experiments prompt an LLM as a boolean classifier, and edit its T/F response to propositions. Our method applies GT for location tracing, and then edit the model at that location using a mild variant of Rank-One Model Editing (ROME). On datasets of binary propositions derived from the CounterFact dataset, we show that our method -- without access to subject labels -- performs close to state-of-the-art L\&E methods which has access subject labels. We then introduce a new dataset, Factual Accuracy Classification Test (FACT), which includes non-binary propositions and for which subject labels are not generally applicable, and therefore is beyond the scope of existing L\&E methods. Nevertheless, we show that with our method editing is possible on FACT. △ Less

Submitted 15 January, 2024; originally announced January 2024.

arXiv:2401.05159 [pdf, other]

Derm-T2IM: Harnessing Synthetic Skin Lesion Data via Stable Diffusion Models for Enhanced Skin Disease Classification using ViT and CNN

Authors: Muhammad Ali Farooq, Wang Yao, Michael Schukat, Mark A Little, Peter Corcoran

Abstract: This study explores the utilization of Dermatoscopic synthetic data generated through stable diffusion models as a strategy for enhancing the robustness of machine learning model training. Synthetic data generation plays a pivotal role in mitigating challenges associated with limited labeled datasets, thereby facilitating more effective model training. In this context, we aim to incorporate enhanc… ▽ More This study explores the utilization of Dermatoscopic synthetic data generated through stable diffusion models as a strategy for enhancing the robustness of machine learning model training. Synthetic data generation plays a pivotal role in mitigating challenges associated with limited labeled datasets, thereby facilitating more effective model training. In this context, we aim to incorporate enhanced data transformation techniques by extending the recent success of few-shot learning and a small amount of data representation in text-to-image latent diffusion models. The optimally tuned model is further used for rendering high-quality skin lesion synthetic data with diverse and realistic characteristics, providing a valuable supplement and diversity to the existing training data. We investigate the impact of incorporating newly generated synthetic data into the training pipeline of state-of-art machine learning models, assessing its effectiveness in enhancing model performance and generalization to unseen real-world data. Our experimental results demonstrate the efficacy of the synthetic data generated through stable diffusion models helps in improving the robustness and adaptability of end-to-end CNN and vision transformer models on two different real-world skin lesion datasets. △ Less

Submitted 10 January, 2024; originally announced January 2024.

Comments: Paper is submitted in EMBC 2024 Conference

arXiv:2401.03601 [pdf, other]

InFoBench: Evaluating Instruction Following Ability in Large Language Models

Authors: Yiwei Qin, Kaiqiang Song, Yebowen Hu, Wenlin Yao, Sangwoo Cho, Xiaoyang Wang, Xuansheng Wu, Fei Liu, Pengfei Liu, Dong Yu

Abstract: This paper introduces the Decomposed Requirements Following Ratio (DRFR), a new metric for evaluating Large Language Models' (LLMs) ability to follow instructions. Addressing a gap in current methodologies, DRFR breaks down complex instructions into simpler criteria, facilitating a detailed analysis of LLMs' compliance with various aspects of tasks. Alongside this metric, we present InFoBench, a b… ▽ More This paper introduces the Decomposed Requirements Following Ratio (DRFR), a new metric for evaluating Large Language Models' (LLMs) ability to follow instructions. Addressing a gap in current methodologies, DRFR breaks down complex instructions into simpler criteria, facilitating a detailed analysis of LLMs' compliance with various aspects of tasks. Alongside this metric, we present InFoBench, a benchmark comprising 500 diverse instructions and 2,250 decomposed questions across multiple constraint categories. Our experiments compare DRFR with traditional scoring methods and explore annotation sources, including human experts, crowd-sourced workers, and GPT-4. The findings demonstrate DRFR's higher reliability and the effectiveness of using GPT-4 as a cost-efficient annotator. The evaluation of several advanced LLMs using this framework reveals their strengths and areas needing improvement, particularly in complex instruction-following. This study contributes a novel metric and benchmark, offering insights for future LLM development and evaluation. △ Less

Submitted 7 January, 2024; originally announced January 2024.

arXiv:2401.01730 [pdf, other]

STAF: 3D Human Mesh Recovery from Video with Spatio-Temporal Alignment Fusion

Authors: Wei Yao, Hongwen Zhang, Yunlian Sun, Jinhui Tang

Abstract: The recovery of 3D human mesh from monocular images has significantly been developed in recent years. However, existing models usually ignore spatial and temporal information, which might lead to mesh and image misalignment and temporal discontinuity. For this reason, we propose a novel Spatio-Temporal Alignment Fusion (STAF) model. As a video-based model, it leverages coherence clues from human m… ▽ More The recovery of 3D human mesh from monocular images has significantly been developed in recent years. However, existing models usually ignore spatial and temporal information, which might lead to mesh and image misalignment and temporal discontinuity. For this reason, we propose a novel Spatio-Temporal Alignment Fusion (STAF) model. As a video-based model, it leverages coherence clues from human motion by an attention-based Temporal Coherence Fusion Module (TCFM). As for spatial mesh-alignment evidence, we extract fine-grained local information through predicted mesh projection on the feature maps. Based on the spatial features, we further introduce a multi-stage adjacent Spatial Alignment Fusion Module (SAFM) to enhance the feature representation of the target frame. In addition to the above, we propose an Average Pooling Module (APM) to allow the model to focus on the entire input sequence rather than just the target frame. This method can remarkably improve the smoothness of recovery results from video. Extensive experiments on 3DPW, MPII3D, and H36M demonstrate the superiority of STAF. We achieve a state-of-the-art trade-off between precision and smoothness. Our code and more video results are on the project page https://yw0208.github.io/staf/ △ Less

Submitted 3 January, 2024; originally announced January 2024.

Comments: Project Page: https://yw0208.github.io/staf/

arXiv:2312.11336 [pdf, other]

DRDT: Dynamic Reflection with Divergent Thinking for LLM-based Sequential Recommendation

Authors: Yu Wang, Zhiwei Liu, Jianguo Zhang, Weiran Yao, Shelby Heinecke, Philip S. Yu

Abstract: The rise of Large Language Models (LLMs) has sparked interest in their application to sequential recommendation tasks as they can provide supportive item information. However, due to the inherent complexities of sequential recommendation, such as sequential patterns across datasets, noise within sequences, and the temporal evolution of user preferences, existing LLM reasoning strategies, such as i… ▽ More The rise of Large Language Models (LLMs) has sparked interest in their application to sequential recommendation tasks as they can provide supportive item information. However, due to the inherent complexities of sequential recommendation, such as sequential patterns across datasets, noise within sequences, and the temporal evolution of user preferences, existing LLM reasoning strategies, such as in-context learning and chain-of-thought are not fully effective. To address these challenges, we introduce a novel reasoning principle: Dynamic Reflection with Divergent Thinking within a retriever-reranker framework. Our approach starts with a collaborative in-context demonstration retriever, which collects sequences exhibiting collaborative behaviors as in-context examples. Following this, we abstract high-level user preferences across multiple aspects, providing a more nuanced understanding of user interests and circumventing the noise within the raw sequences. The cornerstone of our methodology is dynamic reflection, a process that emulates human learning through probing, critiquing, and reflecting, using user feedback to tailor the analysis more effectively to the target user in a temporal manner. We evaluate our approach on three datasets using six pre-trained LLMs. The superior performance observed across these models demonstrates the efficacy of our reasoning strategy, notably achieved without the need to fine-tune the LLMs. With our principle, we managed to outperform GPT-Turbo-3.5 on three datasets using 7b models e.g., Vicuna-7b and Openchat-7b on NDCG@10. This research not only highlights the potential of LLMs in enhancing sequential recommendation systems but also underscores the importance of developing tailored reasoning strategies to fully harness their capabilities. △ Less

Submitted 18 December, 2023; originally announced December 2023.

arXiv:2312.02772 [pdf, other]

FG-MDM: Towards Zero-Shot Human Motion Generation via Fine-Grained Descriptions

Authors: Xu Shi, Wei Yao, Chuanchen Luo, Junran Peng, Hongwen Zhang, Yunlian Sun

Abstract: Recently, significant progress has been made in text-based motion generation, enabling the generation of diverse and high-quality human motions that conform to textual descriptions. However, generating motions beyond the distribution of original datasets remains challenging, i.e., zero-shot generation. By adopting a divide-and-conquer strategy, we propose a new framework named Fine-Grained Human M… ▽ More Recently, significant progress has been made in text-based motion generation, enabling the generation of diverse and high-quality human motions that conform to textual descriptions. However, generating motions beyond the distribution of original datasets remains challenging, i.e., zero-shot generation. By adopting a divide-and-conquer strategy, we propose a new framework named Fine-Grained Human Motion Diffusion Model (FG-MDM) for zero-shot human motion generation. Specifically, we first parse previous vague textual annotations into fine-grained descriptions of different body parts by leveraging a large language model. We then use these fine-grained descriptions to guide a transformer-based diffusion model, which further adopts a design of part tokens. FG-MDM can generate human motions beyond the scope of original datasets owing to descriptions that are closer to motion essence. Our experimental results demonstrate the superiority of FG-MDM over previous methods in zero-shot settings. We will release our fine-grained textual annotations for HumanML3D and KIT. △ Less

Submitted 23 April, 2024; v1 submitted 5 December, 2023; originally announced December 2023.

Comments: Project Page: https://sx0207.github.io/fg-mdm/

arXiv:2311.18340 [pdf, other]

Neuromorphic Incremental on-chip Learning with Hebbian Weight Consolidation

Authors: Zifan Ning, Chaojin Chen, Xiang Cheng, Wangzi Yao, Tielin Zhang, Bo Xu

Abstract: As next-generation implantable brain-machine interfaces become pervasive on edge device, incrementally learning new tasks in bio-plasticity ways is urgently demanded for Neuromorphic chips. Due to the inherent characteristics of its structure, spiking neural networks are naturally well-suited for BMI-chips. Here we propose Hebbian Weight Consolidation, as well as an on-chip learning framework. HWC… ▽ More As next-generation implantable brain-machine interfaces become pervasive on edge device, incrementally learning new tasks in bio-plasticity ways is urgently demanded for Neuromorphic chips. Due to the inherent characteristics of its structure, spiking neural networks are naturally well-suited for BMI-chips. Here we propose Hebbian Weight Consolidation, as well as an on-chip learning framework. HWC selectively masks synapse modifications for previous tasks, retaining them to store new knowledge from subsequent tasks while preserving the old knowledge. Leveraging the bio-plasticity of dendritic spines, the intrinsic self-organizing nature of Hebbian Weight Consolidation aligns naturally with the incremental learning paradigm, facilitating robust learning outcomes. By reading out spikes layer by layer and performing back-propagation on the external micro-controller unit, MLoC can efficiently accomplish on-chip learning. Experiments show that our HWC algorithm up to 23.19% outperforms lower bound that without incremental learning algorithm, particularly in more challenging monkey behavior decoding scenarios. Taking into account on-chip computing on Synsense Speck 2e chip, our proposed algorithm exhibits an improvement of 11.06%. This study demonstrates the feasibility of employing incremental learning for high-performance neural signal decoding in next-generation brain-machine interfaces. △ Less

Submitted 30 November, 2023; originally announced November 2023.

Comments: 12 pages, 6 figures

arXiv:2311.17460 [pdf, other]

W-HMR: Human Mesh Recovery in World Space with Weak-supervised Camera Calibration and Orientation Correction

Authors: Wei Yao, Hongwen Zhang, Yunlian Sun, Jinhui Tang

Abstract: For a long time, in reconstructing 3D human bodies from monocular images, most methods opted to simplify the task by minimizing the influence of the camera. Using a coarse focal length setting results in the reconstructed bodies not aligning well with distorted images. Ignoring camera rotation leads to an unrealistic reconstructed body pose in world space. Consequently, the application scenarios o… ▽ More For a long time, in reconstructing 3D human bodies from monocular images, most methods opted to simplify the task by minimizing the influence of the camera. Using a coarse focal length setting results in the reconstructed bodies not aligning well with distorted images. Ignoring camera rotation leads to an unrealistic reconstructed body pose in world space. Consequently, the application scenarios of existing methods are confined to controlled environments. When confronted with complex and diverse in-the-wild images, they struggle to achieve accurate and reasonable reconstruction in world space. To address the above issues, we propose W-HMR, which decouples global body recovery into camera calibration, local body recovery, and global body orientation correction. We design the first weak-supervised camera calibration method for body distortion, eliminating dependence on focal length labels and achieving finer mesh-image alignment. We propose a novel orientation correction module to allow the reconstructed human body to remain normal in world space. Decoupling body orientation and body pose enables our model to consider the accuracy in camera coordinate and the reasonableness in world coordinate simultaneously, expanding the range of applications. As a result, W-HMR achieves high-quality reconstruction in dual coordinate systems, particularly in challenging scenes. Codes and demos have been released on the project page https://yw0208.github.io/w-hmr/. △ Less

Submitted 24 March, 2024; v1 submitted 29 November, 2023; originally announced November 2023.

Comments: Project Page: https://yw0208.github.io/w-hmr/

arXiv:2311.14899 [pdf, other]

HyperDID: Hyperspectral Intrinsic Image Decomposition with Deep Feature Embedding

Authors: Zhiqiang Gong, Xian Zhou, Wen Yao, Xiaohu Zheng, Ping Zhong

Abstract: The dissection of hyperspectral images into intrinsic components through hyperspectral intrinsic image decomposition (HIID) enhances the interpretability of hyperspectral data, providing a foundation for more accurate classification outcomes. However, the classification performance of HIID is constrained by the model's representational ability. To address this limitation, this study rethinks hyper… ▽ More The dissection of hyperspectral images into intrinsic components through hyperspectral intrinsic image decomposition (HIID) enhances the interpretability of hyperspectral data, providing a foundation for more accurate classification outcomes. However, the classification performance of HIID is constrained by the model's representational ability. To address this limitation, this study rethinks hyperspectral intrinsic image decomposition for classification tasks by introducing deep feature embedding. The proposed framework, HyperDID, incorporates the Environmental Feature Module (EFM) and Categorical Feature Module (CFM) to extract intrinsic features. Additionally, a Feature Discrimination Module (FDM) is introduced to separate environment-related and category-related features. Experimental results across three commonly used datasets validate the effectiveness of HyperDID in improving hyperspectral image classification performance. This novel approach holds promise for advancing the capabilities of hyperspectral image analysis by leveraging deep feature embedding principles. The implementation of the proposed method could be accessed soon at https://github.com/shendu-sw/HyperDID for the sake of reproducibility. △ Less

Submitted 24 November, 2023; originally announced November 2023.

Comments: Submitted to IEEE TGRS

arXiv:2311.11182 [pdf, other]

Exponentially Convergent Algorithms for Supervised Matrix Factorization

Authors: Joowon Lee, Hanbaek Lyu, Weixin Yao

Abstract: Supervised matrix factorization (SMF) is a classical machine learning method that simultaneously seeks feature extraction and classification tasks, which are not necessarily a priori aligned objectives. Our goal is to use SMF to learn low-rank latent factors that offer interpretable, data-reconstructive, and class-discriminative features, addressing challenges posed by high-dimensional data. Train… ▽ More Supervised matrix factorization (SMF) is a classical machine learning method that simultaneously seeks feature extraction and classification tasks, which are not necessarily a priori aligned objectives. Our goal is to use SMF to learn low-rank latent factors that offer interpretable, data-reconstructive, and class-discriminative features, addressing challenges posed by high-dimensional data. Training SMF model involves solving a nonconvex and possibly constrained optimization with at least three blocks of parameters. Known algorithms are either heuristic or provide weak convergence guarantees for special cases. In this paper, we provide a novel framework that 'lifts' SMF as a low-rank matrix estimation problem in a combined factor space and propose an efficient algorithm that provably converges exponentially fast to a global minimizer of the objective with arbitrary initialization under mild assumptions. Our framework applies to a wide range of SMF-type problems for multi-class classification with auxiliary features. To showcase an application, we demonstrate that our algorithm successfully identified well-known cancer-associated gene groups for various cancers. △ Less

Submitted 18 November, 2023; originally announced November 2023.

Comments: 33 pages, 3 figures. arXiv admin note: substantial text overlap with arXiv:2206.06774

Journal ref: Neural Information Processing Systems 2023

arXiv:2311.10774 [pdf, other]

MMC: Advancing Multimodal Chart Understanding with Large-scale Instruction Tuning

Authors: Fuxiao Liu, Xiaoyang Wang, Wenlin Yao, Jianshu Chen, Kaiqiang Song, Sangwoo Cho, Yaser Yacoob, Dong Yu

Abstract: With the rapid development of large language models (LLMs) and their integration into large multimodal models (LMMs), there has been impressive progress in zero-shot completion of user-oriented vision-language tasks. However, a gap remains in the domain of chart image understanding due to the distinct abstract components in charts. To address this, we introduce a large-scale MultiModal Chart Instr… ▽ More With the rapid development of large language models (LLMs) and their integration into large multimodal models (LMMs), there has been impressive progress in zero-shot completion of user-oriented vision-language tasks. However, a gap remains in the domain of chart image understanding due to the distinct abstract components in charts. To address this, we introduce a large-scale MultiModal Chart Instruction (\textbf{MMC-Instruction}) dataset comprising 600k instances supporting diverse tasks and chart types. Leveraging this data, we develop MultiModal Chart Assistant (\textbf{MMCA}), an LMM that achieves state-of-the-art performance on existing chart QA benchmarks. Recognizing the need for a comprehensive evaluation of LMM chart understanding, we also propose a MultiModal Chart Benchmark (\textbf{MMC-Benchmark}), a comprehensive human-annotated benchmark with nine distinct tasks evaluating reasoning capabilities over charts. Extensive experiments on MMC-Benchmark reveal the limitations of existing LMMs on correctly interpreting charts, even for the most recent GPT-4V model. Our work provides an instruction-tuning methodology and benchmark to advance multimodal understanding of charts. Code and data are available at https://github.com/FuxiaoLiu/MMC. △ Less

Submitted 15 April, 2024; v1 submitted 15 November, 2023; originally announced November 2023.

Comments: Accepted to NAACL 2024

arXiv:2311.06307 [pdf]

Synthetic Speaking Children -- Why We Need Them and How to Make Them

Authors: Muhammad Ali Farooq, Dan Bigioi, Rishabh Jain, Wang Yao, Mariam Yiwere, Peter Corcoran

Abstract: Contemporary Human Computer Interaction (HCI) research relies primarily on neural network models for machine vision and speech understanding of a system user. Such models require extensively annotated training datasets for optimal performance and when building interfaces for users from a vulnerable population such as young children, GDPR introduces significant complexities in data collection, mana… ▽ More Contemporary Human Computer Interaction (HCI) research relies primarily on neural network models for machine vision and speech understanding of a system user. Such models require extensively annotated training datasets for optimal performance and when building interfaces for users from a vulnerable population such as young children, GDPR introduces significant complexities in data collection, management, and processing. Motivated by the training needs of an Edge AI smart toy platform this research explores the latest advances in generative neural technologies and provides a working proof of concept of a controllable data generation pipeline for speech driven facial training data at scale. In this context, we demonstrate how StyleGAN2 can be finetuned to create a gender balanced dataset of children's faces. This dataset includes a variety of controllable factors such as facial expressions, age variations, facial poses, and even speech-driven animations with realistic lip synchronization. By combining generative text to speech models for child voice synthesis and a 3D landmark based talking heads pipeline, we can generate highly realistic, entirely synthetic, talking child video clips. These video clips can provide valuable, and controllable, synthetic training data for neural network models, bridging the gap when real data is scarce or restricted due to privacy regulations. △ Less

Submitted 8 November, 2023; originally announced November 2023.

Comments: Presented at SpeD 23

arXiv:2311.05374 [pdf, other]

TencentLLMEval: A Hierarchical Evaluation of Real-World Capabilities for Human-Aligned LLMs

Authors: Shuyi Xie, Wenlin Yao, Yong Dai, Shaobo Wang, Donlin Zhou, Lifeng Jin, Xinhua Feng, Pengzhi Wei, Yujie Lin, Zhichao Hu, Dong Yu, Zhengyou Zhang, Jing Nie, Yuhong Liu

Abstract: Large language models (LLMs) have shown impressive capabilities across various natural language tasks. However, evaluating their alignment with human preferences remains a challenge. To this end, we propose a comprehensive human evaluation framework to assess LLMs' proficiency in following instructions on diverse real-world tasks. We construct a hierarchical task tree encompassing 7 major areas co… ▽ More Large language models (LLMs) have shown impressive capabilities across various natural language tasks. However, evaluating their alignment with human preferences remains a challenge. To this end, we propose a comprehensive human evaluation framework to assess LLMs' proficiency in following instructions on diverse real-world tasks. We construct a hierarchical task tree encompassing 7 major areas covering over 200 categories and over 800 tasks, which covers diverse capabilities such as question answering, reasoning, multiturn dialogue, and text generation, to evaluate LLMs in a comprehensive and in-depth manner. We also design detailed evaluation standards and processes to facilitate consistent, unbiased judgments from human evaluators. A test set of over 3,000 instances is released, spanning different difficulty levels and knowledge domains. Our work provides a standardized methodology to evaluate human alignment in LLMs for both English and Chinese. We also analyze the feasibility of automating parts of evaluation with a strong LLM (GPT-4). Our framework supports a thorough assessment of LLMs as they are integrated into real-world applications. We have made publicly available the task tree, TencentLLMEval dataset, and evaluation methodology which have been demonstrated as effective in assessing the performance of Tencent Hunyuan LLMs. By doing so, we aim to facilitate the benchmarking of advances in the development of safe and human-aligned LLMs. △ Less

Submitted 9 November, 2023; originally announced November 2023.

arXiv:2311.01696 [pdf, other]

Universal Perturbation-based Secret Key-Controlled Data Hiding

Authors: Donghua Wang, Wen Yao, Tingsong Jiang, Xiaoqian Chen

Abstract: Deep neural networks (DNNs) are demonstrated to be vulnerable to universal perturbation, a single quasi-perceptible perturbation that can deceive the DNN on most images. However, the previous works are focused on using universal perturbation to perform adversarial attacks, while the potential usability of universal perturbation as data carriers in data hiding is less explored, especially for the k… ▽ More Deep neural networks (DNNs) are demonstrated to be vulnerable to universal perturbation, a single quasi-perceptible perturbation that can deceive the DNN on most images. However, the previous works are focused on using universal perturbation to perform adversarial attacks, while the potential usability of universal perturbation as data carriers in data hiding is less explored, especially for the key-controlled data hiding method. In this paper, we propose a novel universal perturbation-based secret key-controlled data-hiding method, realizing data hiding with a single universal perturbation and data decoding with the secret key-controlled decoder. Specifically, we optimize a single universal perturbation, which serves as a data carrier that can hide multiple secret images and be added to most cover images. Then, we devise a secret key-controlled decoder to extract different secret images from the single container image constructed by the universal perturbation by using different secret keys. Moreover, a suppress loss function is proposed to prevent the secret image from leakage. Furthermore, we adopt a robust module to boost the decoder's capability against corruption. Finally, A co-joint optimization strategy is proposed to find the optimal universal perturbation and decoder. Extensive experiments are conducted on different datasets to demonstrate the effectiveness of the proposed method. Additionally, the physical test performed on platforms (e.g., WeChat and Twitter) verifies the usability of the proposed method in practice. △ Less

Submitted 2 November, 2023; originally announced November 2023.

Comments: 18 pages, 8 tables, 10 figures

arXiv:2311.00976 [pdf, other]

doi 10.1109/TRO.2023.3266994

Spontaneous-Ordering Platoon Control for Multirobot Path Navigation Using Guiding Vector Fields

Authors: Bin-Bin Hu, Hai-Tao Zhang, Weijia Yao, Jianing Ding, Ming Cao

Abstract: In this paper, we propose a distributed guiding-vector-field (DGVF) algorithm for a team of robots to form a spontaneous-ordering platoon moving along a predefined desired path in the n-dimensional Euclidean space. Particularly, by adding a path parameter as an additional virtual coordinate to each robot, the DGVF algorithm can eliminate the singular points where the vector fields vanish, and gove… ▽ More In this paper, we propose a distributed guiding-vector-field (DGVF) algorithm for a team of robots to form a spontaneous-ordering platoon moving along a predefined desired path in the n-dimensional Euclidean space. Particularly, by adding a path parameter as an additional virtual coordinate to each robot, the DGVF algorithm can eliminate the singular points where the vector fields vanish, and govern robots to approach a closed and even self-intersecting desired path. Then, the interactions among neighboring robots and a virtual target robot through their virtual coordinates enable the realization of the desired platoon; in particular, relative parametric displacements can be achieved with arbitrary ordering sequences. Rigorous analysis is provided to guarantee the global convergence to the spontaneous-ordering platoon on the common desired path from any initial positions. 2D experiments using three HUSTER-0.3 unmanned surface vessels (USVs) are conducted to validate the practical effectiveness of the proposed DGVF algorithm, and 3D numerical simulations are presented to demonstrate its effectiveness and robustness when tackling higher-dimensional multi-robot path-navigation missions and some robots breakdown. △ Less

Submitted 1 November, 2023; originally announced November 2023.

Journal ref: IEEE Transaction on Robotics, 2023

arXiv:2310.18615 [pdf, other]

Temporally Disentangled Representation Learning under Unknown Nonstationarity

Authors: Xiangchen Song, Weiran Yao, Yewen Fan, Xinshuai Dong, Guangyi Chen, Juan Carlos Niebles, Eric Xing, Kun Zhang

Abstract: In unsupervised causal representation learning for sequential data with time-delayed latent causal influences, strong identifiability results for the disentanglement of causally-related latent variables have been established in stationary settings by leveraging temporal structure. However, in nonstationary setting, existing work only partially addressed the problem by either utilizing observed aux… ▽ More In unsupervised causal representation learning for sequential data with time-delayed latent causal influences, strong identifiability results for the disentanglement of causally-related latent variables have been established in stationary settings by leveraging temporal structure. However, in nonstationary setting, existing work only partially addressed the problem by either utilizing observed auxiliary variables (e.g., class labels and/or domain indexes) as side information or assuming simplified latent causal dynamics. Both constrain the method to a limited range of scenarios. In this study, we further explored the Markov Assumption under time-delayed causally related process in nonstationary setting and showed that under mild conditions, the independent latent components can be recovered from their nonlinear mixture up to a permutation and a component-wise transformation, without the observation of auxiliary variables. We then introduce NCTRL, a principled estimation framework, to reconstruct time-delayed latent causal variables and identify their relations from measured sequential data only. Empirical evaluations demonstrated the reliable identification of time-delayed latent causal influences, with our methodology substantially outperforming existing baselines that fail to exploit the nonstationarity adequately and then, consequently, cannot distinguish distribution shifts. △ Less

Submitted 28 October, 2023; originally announced October 2023.

Comments: NeurIPS 2023

arXiv:2310.18550 [pdf, other]

MultiScale Spectral-Spatial Convolutional Transformer for Hyperspectral Image Classification

Authors: Zhiqiang Gong, Xian Zhou, Wen Yao

Abstract: Due to the powerful ability in capturing the global information, Transformer has become an alternative architecture of CNNs for hyperspectral image classification. However, general Transformer mainly considers the global spectral information while ignores the multiscale spatial information of the hyperspectral image. In this paper, we propose a multiscale spectral-spatial convolutional Transformer… ▽ More Due to the powerful ability in capturing the global information, Transformer has become an alternative architecture of CNNs for hyperspectral image classification. However, general Transformer mainly considers the global spectral information while ignores the multiscale spatial information of the hyperspectral image. In this paper, we propose a multiscale spectral-spatial convolutional Transformer (MultiscaleFormer) for hyperspectral image classification. First, the developed method utilizes multiscale spatial patches as tokens to formulate the spatial Transformer and generates multiscale spatial representation of each band in each pixel. Second, the spatial representation of all the bands in a given pixel are utilized as tokens to formulate the spectral Transformer and generate the multiscale spectral-spatial representation of each pixel. Besides, a modified spectral-spatial CAF module is constructed in the MultiFormer to fuse cross-layer spectral and spatial information. Therefore, the proposed MultiFormer can capture the multiscale spectral-spatial information and provide better performance than most of other architectures for hyperspectral image classification. Experiments are conducted over commonly used real-world datasets and the comparison results show the superiority of the proposed method. △ Less

Submitted 27 October, 2023; originally announced October 2023.

Comments: submitted to IEEE GRSL

Showing 1–50 of 189 results for author: Yao, W