Search | arXiv e-print repository

doi 10.1145/3640457.3688113

Not All Videos Become Outdated: Short-Video Recommendation by Learning to Deconfound Release Interval Bias

Authors: Lulu Dong, Guoxiu He, Aixin Sun

Abstract: Short-video recommender systems often exhibit a biased preference to recently released videos. However, not all videos become outdated; certain classic videos can still attract user's attention. Such bias along temporal dimension can be further aggravated by the matching model between users and videos, because the model learns from preexisting interactions. From real data, we observe that differen… ▽ More Short-video recommender systems often exhibit a biased preference to recently released videos. However, not all videos become outdated; certain classic videos can still attract user's attention. Such bias along temporal dimension can be further aggravated by the matching model between users and videos, because the model learns from preexisting interactions. From real data, we observe that different videos have varying sensitivities to recency in attracting users' attention. Our analysis, based on a causal graph modeling short-video recommendation, suggests that the release interval serves as a confounder, establishing a backdoor path between users and videos. To address this confounding effect, we propose a model-agnostic causal architecture called Learning to Deconfound the Release Interval Bias (LDRI). LDRI enables jointly learning of the matching model and the video recency sensitivity perceptron. In the inference stage, we apply a backdoor adjustment, effectively blocking the backdoor path by intervening on each video. Extensive experiments on two benchmarks demonstrate that LDRI consistently outperforms backbone models and exhibits superior performance against state-of-the-art models. Additional comprehensive analyses confirm the deconfounding capability of LDRI. △ Less

Submitted 30 August, 2024; originally announced August 2024.

Journal ref: RecSys 2024

arXiv:2408.09113 [pdf, other]

Planning of Off-Grid Renewable Power to Ammonia Systems with Heterogeneous Flexibility: A Multistakeholder Equilibrium Perspective

Authors: Yangjun Zeng, Yiwei Qiu, Jie Zhu, Shi Chen, Tianlei Zang, Buxiang Zhou, Ge He, Xu Ji

Abstract: Off-grid renewable power to ammonia (ReP2A) systems present a promising pathway toward carbon neutrality in both the energy and chemical industries. However, due to chemical safety requirements, the limited flexibility of ammonia synthesis poses a challenge when attempting to align with the variable hydrogen flow produced from renewable power. This necessitates the optimal sizing of equipment capa… ▽ More Off-grid renewable power to ammonia (ReP2A) systems present a promising pathway toward carbon neutrality in both the energy and chemical industries. However, due to chemical safety requirements, the limited flexibility of ammonia synthesis poses a challenge when attempting to align with the variable hydrogen flow produced from renewable power. This necessitates the optimal sizing of equipment capacity for effective and coordinated production across the system. Additionally, an ReP2A system may involve multiple stakeholders with varying degrees of operational flexibility, complicating the planning problem. This paper first examines the multistakeholder sizing equilibrium (MSSE) of the ReP2A system. First, we propose an MSSE model that accounts for individual planning decisions and the competing economic interests of the stakeholders of power generation, hydrogen production, and ammonia synthesis. We then construct an equivalent optimization problem based on Karush--Kuhn--Tucker (KKT) conditions to determine the equilibrium. Following this, we decompose the problem in the temporal dimension and solve it via multicut generalized Benders decomposition (GBD) to address long-term balancing issues. Case studies based on a realistic project reveal that the equilibrium does not naturally balance the interests of all stakeholders due to their heterogeneous characteristics. Our findings suggest that benefit transfer agreements ensure mutual benefits and the successful implementation of ReP2A projects. △ Less

Submitted 17 August, 2024; originally announced August 2024.

arXiv:2408.08506 [pdf, other]

Ex3: Automatic Novel Writing by Extracting, Excelsior and Expanding

Authors: Huang Lei, Jiaming Guo, Guanhua He, Xishan Zhang, Rui Zhang, Shaohui Peng, Shaoli Liu, Tianshi Chen

Abstract: Generating long-term texts such as novels using artificial intelligence has always been a challenge. A common approach is to use large language models (LLMs) to construct a hierarchical framework that first plans and then writes. Despite the fact that the generated novels reach a sufficient length, they exhibit poor logical coherence and appeal in their plots and deficiencies in character and even… ▽ More Generating long-term texts such as novels using artificial intelligence has always been a challenge. A common approach is to use large language models (LLMs) to construct a hierarchical framework that first plans and then writes. Despite the fact that the generated novels reach a sufficient length, they exhibit poor logical coherence and appeal in their plots and deficiencies in character and event depiction, ultimately compromising the overall narrative quality. In this paper, we propose a method named Extracting Excelsior and Expanding. Ex3 initially extracts structure information from raw novel data. By combining this structure information with the novel data, an instruction-following dataset is meticulously crafted. This dataset is then utilized to fine-tune the LLM, aiming for excelsior generation performance. In the final stage, a tree-like expansion method is deployed to facilitate the generation of arbitrarily long novels. Evaluation against previous methods showcases Ex3's ability to produce higher-quality long-form novels. △ Less

Submitted 15 August, 2024; originally announced August 2024.

arXiv:2408.01607 [pdf]

Deep Learning Meets OBIA: Tasks, Challenges, Strategies, and Perspectives

Authors: Lei Ma, Ziyun Yan, Mengmeng Li, Tao Liu, Liqin Tan, Xuan Wang, Weiqiang He, Ruikun Wang, Guangjun He, Heng Lu, Thomas Blaschke

Abstract: Deep learning has gained significant attention in remote sensing, especially in pixel- or patch-level applications. Despite initial attempts to integrate deep learning into object-based image analysis (OBIA), its full potential remains largely unexplored. In this article, as OBIA usage becomes more widespread, we conducted a comprehensive review and expansion of its task subdomains, with or withou… ▽ More Deep learning has gained significant attention in remote sensing, especially in pixel- or patch-level applications. Despite initial attempts to integrate deep learning into object-based image analysis (OBIA), its full potential remains largely unexplored. In this article, as OBIA usage becomes more widespread, we conducted a comprehensive review and expansion of its task subdomains, with or without the integration of deep learning. Furthermore, we have identified and summarized five prevailing strategies to address the challenge of deep learning's limitations in directly processing unstructured object data within OBIA, and this review also recommends some important future research directions. Our goal with these endeavors is to inspire more exploration in this fascinating yet overlooked area and facilitate the integration of deep learning into OBIA processing workflows. △ Less

Submitted 2 August, 2024; originally announced August 2024.

arXiv:2407.19453 [pdf, other]

FIND: Fine-tuning Initial Noise Distribution with Policy Optimization for Diffusion Models

Authors: Changgu Chen, Libing Yang, Xiaoyan Yang, Lianggangxu Chen, Gaoqi He, CHangbo Wang, Yang Li

Abstract: In recent years, large-scale pre-trained diffusion models have demonstrated their outstanding capabilities in image and video generation tasks. However, existing models tend to produce visual objects commonly found in the training dataset, which diverges from user input prompts. The underlying reason behind the inaccurate generated results lies in the model's difficulty in sampling from specific i… ▽ More In recent years, large-scale pre-trained diffusion models have demonstrated their outstanding capabilities in image and video generation tasks. However, existing models tend to produce visual objects commonly found in the training dataset, which diverges from user input prompts. The underlying reason behind the inaccurate generated results lies in the model's difficulty in sampling from specific intervals of the initial noise distribution corresponding to the prompt. Moreover, it is challenging to directly optimize the initial distribution, given that the diffusion process involves multiple denoising steps. In this paper, we introduce a Fine-tuning Initial Noise Distribution (FIND) framework with policy optimization, which unleashes the powerful potential of pre-trained diffusion networks by directly optimizing the initial distribution to align the generated contents with user-input prompts. To this end, we first reformulate the diffusion denoising procedure as a one-step Markov decision process and employ policy optimization to directly optimize the initial distribution. In addition, a dynamic reward calibration module is proposed to ensure training stability during optimization. Furthermore, we introduce a ratio clipping algorithm to utilize historical data for network training and prevent the optimized distribution from deviating too far from the original policy to restrain excessive optimization magnitudes. Extensive experiments demonstrate the effectiveness of our method in both text-to-image and text-to-video tasks, surpassing SOTA methods in achieving consistency between prompts and the generated content. Our method achieves 10 times faster than the SOTA approach. Our homepage is available at \url{https://github.com/vpx-ecnu/FIND-website}. △ Less

Submitted 28 July, 2024; originally announced July 2024.

arXiv:2407.13205 [pdf, ps, other]

Transformer-based Single-Cell Language Model: A Survey

Authors: Wei Lan, Guohang He, Mingyang Liu, Qingfeng Chen, Junyue Cao, Wei Peng

Abstract: The transformers have achieved significant accomplishments in the natural language processing as its outstanding parallel processing capabilities and highly flexible attention mechanism. In addition, increasing studies based on transformers have been proposed to model single-cell data. In this review, we attempt to systematically summarize the single-cell language models and applications based on… ▽ More The transformers have achieved significant accomplishments in the natural language processing as its outstanding parallel processing capabilities and highly flexible attention mechanism. In addition, increasing studies based on transformers have been proposed to model single-cell data. In this review, we attempt to systematically summarize the single-cell language models and applications based on transformers. First, we provide a detailed introduction about the structure and principles of transformers. Then, we review the single-cell language models and large language models for single-cell data analysis. Moreover, we explore the datasets and applications of single-cell language models in downstream tasks such as batch correction, cell clustering, cell type annotation, gene regulatory network inference and perturbation response. Further, we discuss the challenges of single-cell language models and provide promising research directions. We hope this review will serve as an up-to-date reference for researchers interested in the direction of single-cell language models. △ Less

Submitted 18 July, 2024; originally announced July 2024.

arXiv:2407.12678 [pdf, other]

Promptable Counterfactual Diffusion Model for Unified Brain Tumor Segmentation and Generation with MRIs

Authors: Yiqing Shen, Guannan He, Mathias Unberath

Abstract: Brain tumor analysis in Magnetic Resonance Imaging (MRI) is crucial for accurate diagnosis and treatment planning. However, the task remains challenging due to the complexity and variability of tumor appearances, as well as the scarcity of labeled data. Traditional approaches often address tumor segmentation and image generation separately, limiting their effectiveness in capturing the intricate r… ▽ More Brain tumor analysis in Magnetic Resonance Imaging (MRI) is crucial for accurate diagnosis and treatment planning. However, the task remains challenging due to the complexity and variability of tumor appearances, as well as the scarcity of labeled data. Traditional approaches often address tumor segmentation and image generation separately, limiting their effectiveness in capturing the intricate relationships between healthy and pathological tissue structures. We introduce a novel promptable counterfactual diffusion model as a unified solution for brain tumor segmentation and generation in MRI. The key innovation lies in our mask-level prompting mechanism at the sampling stage, which enables guided generation and manipulation of specific healthy or unhealthy regions in MRI images. Specifically, the model's architecture allows for bidirectional inference, which can segment tumors in existing images and generate realistic tumor structures in healthy brain scans. Furthermore, we present a two-step approach for tumor generation and position transfer, showcasing the model's versatility in synthesizing realistic tumor structures. Experiments on the BRATS2021 dataset demonstrate that our method outperforms traditional counterfactual diffusion approaches, achieving a mean IoU of 0.653 and mean Dice score of 0.785 for tumor segmentation, outperforming the 0.344 and 0.475 of conventional counterfactual diffusion model. Our work contributes to improving brain tumor detection and segmentation accuracy, with potential implications for data augmentation and clinical decision support in neuro-oncology. The code is available at https://github.com/arcadelab/counterfactual_diffusion. △ Less

Submitted 17 July, 2024; originally announced July 2024.

arXiv:2407.11292 [pdf]

LoRA-PT: Low-Rank Adapting UNETR for Hippocampus Segmentation Using Principal Tensor Singular Values and Vectors

Authors: Guanghua He, Wangang Cheng, Hancan Zhu, Gaohang Yu

Abstract: The hippocampus is a crucial brain structure associated with various psychiatric disorders, and its automatic and precise segmentation is essential for studying these diseases. In recent years, deep learning-based methods have made significant progress in hippocampus segmentation. However, training deep neural network models requires substantial computational resources and time, as well as a large… ▽ More The hippocampus is a crucial brain structure associated with various psychiatric disorders, and its automatic and precise segmentation is essential for studying these diseases. In recent years, deep learning-based methods have made significant progress in hippocampus segmentation. However, training deep neural network models requires substantial computational resources and time, as well as a large amount of labeled training data, which is often difficult to obtain in medical image segmentation. To address this issue, we propose a new parameter-efficient fine-tuning method called LoRA-PT. This method transfers the pre-trained UNETR model on the BraTS2021 dataset to the hippocampus segmentation task. Specifically, the LoRA-PT method categorizes the parameter matrix of the transformer structure into three sizes, forming three 3D tensors. Through tensor singular value decomposition, these tensors are decomposed to generate low-rank tensors with the principal singular values and singular vectors, while the remaining singular values and vectors form the residual tensor. During the fine-tuning, we only update the low-rank tensors, i.e. the principal tensor singular values and vectors, while keeping the residual tensor unchanged. We validated the proposed method on three public hippocampus datasets. Experimental results show that LoRA-PT outperforms existing parameter-efficient fine-tuning methods in segmentation accuracy while significantly reducing the number of parameter updates. Our code is available at https://github.com/WangangCheng/LoRA-PT/tree/LoRA-PT. △ Less

Submitted 18 July, 2024; v1 submitted 15 July, 2024; originally announced July 2024.

arXiv:2407.05587 [pdf, other]

Flying Calligrapher: Contact-Aware Motion and Force Planning and Control for Aerial Manipulation

Authors: Xiaofeng Guo, Guanqi He, Jiahe Xu, Mohammadreza Mousaei, Junyi Geng, Sebastian Scherer, Guanya Shi

Abstract: Aerial manipulation has gained interest in completing high-altitude tasks that are challenging for human workers, such as contact inspection and defect detection, etc. Previous research has focused on maintaining static contact points or forces. This letter addresses a more general and dynamic task: simultaneously tracking time-varying contact force in the surface normal direction and motion traje… ▽ More Aerial manipulation has gained interest in completing high-altitude tasks that are challenging for human workers, such as contact inspection and defect detection, etc. Previous research has focused on maintaining static contact points or forces. This letter addresses a more general and dynamic task: simultaneously tracking time-varying contact force in the surface normal direction and motion trajectories on tangential surfaces. We propose a pipeline that includes a contact-aware trajectory planner to generate dynamically feasible trajectories, and a hybrid motion-force controller to track such trajectories. We demonstrate the approach in an aerial calligraphy task using a novel sponge pen design as the end-effector, whose stroke width is proportional to the contact force. Additionally, we develop a touchscreen interface for flexible user input. Experiments show our method can effectively draw diverse letters, achieving an IoU of 0.59 and an end-effector position (force) tracking RMSE of 2.9 cm (0.7 N). Website: https://xiaofeng-guo.github.io/flying-calligrapher/ △ Less

Submitted 7 July, 2024; originally announced July 2024.

Comments: 8 pages, 9 figures, 1 table

arXiv:2407.04202 [pdf, other]

Reverse Engineering the Fly Brain Using FlyCircuit Database

Authors: Yu-Tai Ching, Chin-Ping Cho, Fu-Kai Tang, Yi-Chiun Chang, Chang-Chieh Cheng, Guan-Wei He, Ann-Shyn Chang, Chaochun Chuang

Abstract: A method to reverse engineering of a fly brain using the {\it FlyCircuit} database is presented. This method was designed based on the assumption that similar neurons could serve identical functions. We thus cluster the neurons based on the similarity between neurons. The procedures are to partition the neurons in the database into groups, and then assemble the groups into potential modules. Some… ▽ More A method to reverse engineering of a fly brain using the {\it FlyCircuit} database is presented. This method was designed based on the assumption that similar neurons could serve identical functions. We thus cluster the neurons based on the similarity between neurons. The procedures are to partition the neurons in the database into groups, and then assemble the groups into potential modules. Some of the modules correspond to known neuropils, including Medulla were obtained. The same clustering algorithm was applied to analyze Medulla's structure. Another possible application of the clustering result is to study the brain-wide neuron connectome by looking at the connectivity between groups of neurons. △ Less

Submitted 4 July, 2024; originally announced July 2024.

arXiv:2406.16622 [pdf, other]

Simultaneous Generation of Quantum Frequency Combs across Distinct Modal Families in a Single $Si_3 N_4$ Whispering Gallery Mode Resonator

Authors: Bo Ji, Nianqin Li, Guangqiang He

Abstract: Quantum frequency combs (QFCs) are versatile resources for multi-mode entanglement, such as cluster states, crucial for quantum communication and computation. On-chip whispering gallery mode resonators (WGMRs) can generate these states at ultra-low threshold power. In this paper, we demonstrate the simultaneous generation of three QFCs using a single on-chip $Si_3N_4$ WGMR across distinct modal fa… ▽ More Quantum frequency combs (QFCs) are versatile resources for multi-mode entanglement, such as cluster states, crucial for quantum communication and computation. On-chip whispering gallery mode resonators (WGMRs) can generate these states at ultra-low threshold power. In this paper, we demonstrate the simultaneous generation of three QFCs using a single on-chip $Si_3N_4$ WGMR across distinct modal families. We designed a micro-ring resonator with a radius of 240 $μm$, capable of supporting four modal families within the 130 to 260 $THz$ frequency range for consistency regulation. Our results indicate that, by carefully designing the structure of $Si_3 N_4$ WGMRs, it is possible to generate quantum entangled frequency combs across distinct modal families simultaneously using monochromatic pump light. This is achieved by modulating the pump mode profiles with a spatial light modulator (SLM). Our approach offers a simple and low-cost method to achieve higher-density entanglement integration on-chip. △ Less

Submitted 24 June, 2024; originally announced June 2024.

Comments: 14 pages, 8 figures

arXiv:2406.11857 [pdf, other]

AI Royalties -- an IP Framework to Compensate Artists & IP Holders for AI-Generated Content

Authors: Pablo Ducru, Jonathan Raiman, Ronaldo Lemos, Clay Garner, George He, Hanna Balcha, Gabriel Souto, Sergio Branco, Celina Bottino

Abstract: This article investigates how AI-generated content can disrupt central revenue streams of the creative industries, in particular the collection of dividends from intellectual property (IP) rights. It reviews the IP and copyright questions related to the input and output of generative AI systems. A systematic method is proposed to assess whether AI-generated outputs, especially images, infringe pre… ▽ More This article investigates how AI-generated content can disrupt central revenue streams of the creative industries, in particular the collection of dividends from intellectual property (IP) rights. It reviews the IP and copyright questions related to the input and output of generative AI systems. A systematic method is proposed to assess whether AI-generated outputs, especially images, infringe previous copyrights, using a similarity metric (CLIP) between images against historical copyright rulings. An examination (economic and technical feasibility) of previously proposed compensation frameworks reveals their financial implications for creatives and IP holders. Lastly, we propose a novel IP framework for compensation of artists and IP holders based on their published "licensed AIs" as a new medium and asset from which to collect AI royalties. △ Less

Submitted 5 April, 2024; originally announced June 2024.

Comments: 7 pages, 2 figures, submitted to AAAI

arXiv:2406.06626 [pdf, other]

Benchmarking Neural Decoding Backbones towards Enhanced On-edge iBCI Applications

Authors: Zhou Zhou, Guohang He, Zheng Zhang, Luziwei Leng, Qinghai Guo, Jianxing Liao, Xuan Song, Ran Cheng

Abstract: Traditional invasive Brain-Computer Interfaces (iBCIs) typically depend on neural decoding processes conducted on workstations within laboratory settings, which prevents their everyday usage. Implementing these decoding processes on edge devices, such as the wearables, introduces considerable challenges related to computational demands, processing speed, and maintaining accuracy. This study seeks… ▽ More Traditional invasive Brain-Computer Interfaces (iBCIs) typically depend on neural decoding processes conducted on workstations within laboratory settings, which prevents their everyday usage. Implementing these decoding processes on edge devices, such as the wearables, introduces considerable challenges related to computational demands, processing speed, and maintaining accuracy. This study seeks to identify an optimal neural decoding backbone that boasts robust performance and swift inference capabilities suitable for edge deployment. We executed a series of neural decoding experiments involving nonhuman primates engaged in random reaching tasks, evaluating four prospective models, Gated Recurrent Unit (GRU), Transformer, Receptance Weighted Key Value (RWKV), and Selective State Space model (Mamba), across several metrics: single-session decoding, multi-session decoding, new session fine-tuning, inference speed, calibration speed, and scalability. The findings indicate that although the GRU model delivers sufficient accuracy, the RWKV and Mamba models are preferable due to their superior inference and calibration speeds. Additionally, RWKV and Mamba comply with the scaling law, demonstrating improved performance with larger data sets and increased model sizes, whereas GRU shows less pronounced scalability, and the Transformer model requires computational resources that scale prohibitively. This paper presents a thorough comparative analysis of the four models in various scenarios. The results are pivotal in pinpointing an optimal backbone that can handle increasing data volumes and is viable for edge implementation. This analysis provides essential insights for ongoing research and practical applications in the field. △ Less

Submitted 7 June, 2024; originally announced June 2024.

arXiv:2405.15885 [pdf, other]

Diffusion Bridge Implicit Models

Authors: Kaiwen Zheng, Guande He, Jianfei Chen, Fan Bao, Jun Zhu

Abstract: Denoising diffusion bridge models (DDBMs) are a powerful variant of diffusion models for interpolating between two arbitrary paired distributions given as endpoints. Despite their promising performance in tasks like image translation, DDBMs require a computationally intensive sampling process that involves the simulation of a (stochastic) differential equation through hundreds of network evaluatio… ▽ More Denoising diffusion bridge models (DDBMs) are a powerful variant of diffusion models for interpolating between two arbitrary paired distributions given as endpoints. Despite their promising performance in tasks like image translation, DDBMs require a computationally intensive sampling process that involves the simulation of a (stochastic) differential equation through hundreds of network evaluations. In this work, we present diffusion bridge implicit models (DBIMs) for accelerated sampling of diffusion bridges without extra training. We generalize DDBMs via a class of non-Markovian diffusion bridges defined on the discretized timesteps concerning sampling, which share the same training objective as DDBMs. These generalized diffusion bridges give rise to generative processes ranging from stochastic to deterministic (i.e., an implicit probabilistic model) while being up to 25$\times$ faster than the vanilla sampler of DDBMs. Moreover, the deterministic sampling procedure yielded by DBIMs enables faithful encoding and reconstruction by a booting noise used in the initial sampling step, and allows us to perform semantically meaningful interpolation in image translation tasks by regarding the booting noise as the latent variable. △ Less

Submitted 24 May, 2024; originally announced May 2024.

arXiv:2405.06083 [pdf, other]

Exploring Entanglement Spectrum and Phase Diagram in multi-electron Quantum Dot Chains

Authors: Guanjie He, Xin Wang

Abstract: We investigate the entanglement properties in semiconductor quantum dot systems modeled by extended Hubbard model, focusing on the impact of potential energy variations and electron interactions within a four-site quantum dot spin chain. Our study explores local and pairwise entanglement across configurations with electron counts N=4 and N=6, under different potential energy settings. By adjusting… ▽ More We investigate the entanglement properties in semiconductor quantum dot systems modeled by extended Hubbard model, focusing on the impact of potential energy variations and electron interactions within a four-site quantum dot spin chain. Our study explores local and pairwise entanglement across configurations with electron counts N=4 and N=6, under different potential energy settings. By adjusting the potential energy in specific dots and examining the entanglement across various interaction regimes, we identify significant variations in the ground states of quantum dots. Our results reveal that local potential modifications lead to notable redistributions of electron configurations, significantly affecting the entanglement properties. These changes are depicted in phase diagrams that show entanglement dependencies on interaction strengths and potential energy adjustments, highlighting complex entanglement dynamics and phase transitions triggered by inter-dot interactions. △ Less

Submitted 9 May, 2024; originally announced May 2024.

Comments: 14 pages, 9 figures

arXiv:2405.04233 [pdf, other]

Vidu: a Highly Consistent, Dynamic and Skilled Text-to-Video Generator with Diffusion Models

Authors: Fan Bao, Chendong Xiang, Gang Yue, Guande He, Hongzhou Zhu, Kaiwen Zheng, Min Zhao, Shilong Liu, Yaole Wang, Jun Zhu

Abstract: We introduce Vidu, a high-performance text-to-video generator that is capable of producing 1080p videos up to 16 seconds in a single generation. Vidu is a diffusion model with U-ViT as its backbone, which unlocks the scalability and the capability for handling long videos. Vidu exhibits strong coherence and dynamism, and is capable of generating both realistic and imaginative videos, as well as un… ▽ More We introduce Vidu, a high-performance text-to-video generator that is capable of producing 1080p videos up to 16 seconds in a single generation. Vidu is a diffusion model with U-ViT as its backbone, which unlocks the scalability and the capability for handling long videos. Vidu exhibits strong coherence and dynamism, and is capable of generating both realistic and imaginative videos, as well as understanding some professional photography techniques, on par with Sora -- the most powerful reported text-to-video generator. Finally, we perform initial experiments on other controllable video generation, including canny-to-video generation, video prediction and subject-driven generation, which demonstrate promising results. △ Less

Submitted 7 May, 2024; originally announced May 2024.

Comments: Project page at https://www.shengshu-ai.com/vidu

arXiv:2404.15493 [pdf, other]

Temperature dependent spin-phonon coupling of boron-vacancy centers in hexagonal boron nitride

Authors: Zhongyuan Liu, Ruotian Gong, Benchen Huang, Yu Jin, Xinyi Du, Guanghui He, Eli Janzen, Li Yang, Erik Henriksen, James Edgar, Giulia Galli, Chong Zu

Abstract: The negatively charged boron-vacancy center ($\mathrm{V}_{\mathrm{B}}^-$) in hexagonal boron nitride (hBN) has recently emerged as a highly promising quantum sensor. Compared to the nitrogen-vacancy (NV) center in diamond, the change with temperature of the spin transition energy of $\mathrm{V}_{\mathrm{B}}^-$ is more than an order of magnitude larger, making it a potential nanoscale thermometer w… ▽ More The negatively charged boron-vacancy center ($\mathrm{V}_{\mathrm{B}}^-$) in hexagonal boron nitride (hBN) has recently emerged as a highly promising quantum sensor. Compared to the nitrogen-vacancy (NV) center in diamond, the change with temperature of the spin transition energy of $\mathrm{V}_{\mathrm{B}}^-$ is more than an order of magnitude larger, making it a potential nanoscale thermometer with superior sensitivity. However, the underlying mechanism of the observed large temperature dependence remains an open question. In this work, using isotopically purified $\mathrm{h}{}^{10}\mathrm{B}{}^{15}\mathrm{N}$, we systematically characterize the zero-field splitting, hyperfine interaction, and spin relaxation time of $\mathrm{V}_{\mathrm{B}}^-$ from 10 to 350$~$K. We carry out first-principle calculations of the $\mathrm{V}_{\mathrm{B}}^-$ spin-phonon interaction and show that a second-order effect from finite-temperature phonon excitations is responsible for the observed changes in experiments. By fitting our experimental results to a physically motivated model, we extract the dominant phonon mode which agrees well with our simulations. Finally, we investigate the dynamic nuclear spin polarization process at cryogenic temperatures. Our results provide key insights in $\mathrm{V}_{\mathrm{B}}^-$ centers and their utilization as nanoscale thermometers and phonon sensors. △ Less

Submitted 23 April, 2024; originally announced April 2024.

Comments: 7 pages, 4 figures and 1 table in main. 9 pages and 5 figures in supplementary

arXiv:2404.13640 [pdf, other]

Beyond Alignment: Blind Video Face Restoration via Parsing-Guided Temporal-Coherent Transformer

Authors: Kepeng Xu, Li Xu, Gang He, Wenxin Yu, Yunsong Li

Abstract: Multiple complex degradations are coupled in low-quality video faces in the real world. Therefore, blind video face restoration is a highly challenging ill-posed problem, requiring not only hallucinating high-fidelity details but also enhancing temporal coherence across diverse pose variations. Restoring each frame independently in a naive manner inevitably introduces temporal incoherence and arti… ▽ More Multiple complex degradations are coupled in low-quality video faces in the real world. Therefore, blind video face restoration is a highly challenging ill-posed problem, requiring not only hallucinating high-fidelity details but also enhancing temporal coherence across diverse pose variations. Restoring each frame independently in a naive manner inevitably introduces temporal incoherence and artifacts from pose changes and keypoint localization errors. To address this, we propose the first blind video face restoration approach with a novel parsing-guided temporal-coherent transformer (PGTFormer) without pre-alignment. PGTFormer leverages semantic parsing guidance to select optimal face priors for generating temporally coherent artifact-free results. Specifically, we pre-train a temporal-spatial vector quantized auto-encoder on high-quality video face datasets to extract expressive context-rich priors. Then, the temporal parse-guided codebook predictor (TPCP) restores faces in different poses based on face parsing context cues without performing face pre-alignment. This strategy reduces artifacts and mitigates jitter caused by cumulative errors from face pre-alignment. Finally, the temporal fidelity regulator (TFR) enhances fidelity through temporal feature interaction and improves video temporal consistency. Extensive experiments on face videos show that our method outperforms previous face restoration baselines. The code will be released on \href{https://github.com/kepengxu/PGTFormer}{https://github.com/kepengxu/PGTFormer}. △ Less

Submitted 21 April, 2024; originally announced April 2024.

Comments: 9 pages

arXiv:2404.07577 [pdf, other]

Generating Comprehensive Lithium Battery Charging Data with Generative AI

Authors: Lidang Jiang, Changyan Hu, Sibei Ji, Hang Zhao, Junxiong Chen, Ge He

Abstract: In optimizing performance and extending the lifespan of lithium batteries, accurate state prediction is pivotal. Traditional regression and classification methods have achieved some success in battery state prediction. However, the efficacy of these data-driven approaches heavily relies on the availability and quality of public datasets. Additionally, generating electrochemical data predominantly… ▽ More In optimizing performance and extending the lifespan of lithium batteries, accurate state prediction is pivotal. Traditional regression and classification methods have achieved some success in battery state prediction. However, the efficacy of these data-driven approaches heavily relies on the availability and quality of public datasets. Additionally, generating electrochemical data predominantly through battery experiments is a lengthy and costly process, making it challenging to acquire high-quality electrochemical data. This difficulty, coupled with data incompleteness, significantly impacts prediction accuracy. Addressing these challenges, this study introduces the End of Life (EOL) and Equivalent Cycle Life (ECL) as conditions for generative AI models. By integrating an embedding layer into the CVAE model, we developed the Refined Conditional Variational Autoencoder (RCVAE). Through preprocessing data into a quasi-video format, our study achieves an integrated synthesis of electrochemical data, including voltage, current, temperature, and charging capacity, which is then processed by the RCVAE model. Coupled with customized training and inference algorithms, this model can generate specific electrochemical data for EOL and ECL under supervised conditions. This method provides users with a comprehensive electrochemical dataset, pioneering a new research domain for the artificial synthesis of lithium battery data. Furthermore, based on the detailed synthetic data, various battery state indicators can be calculated, offering new perspectives and possibilities for lithium battery performance prediction. △ Less

Submitted 11 April, 2024; originally announced April 2024.

arXiv:2404.03079 [pdf, other]

vPALs: Towards Verified Performance-aware Learning System For Resource Management

Authors: Guoliang He, Gingfung Yeung, Sheriffo Ceesay, Adam Barker

Abstract: Accurately predicting task performance at runtime in a cluster is advantageous for a resource management system to determine whether a task should be migrated due to performance degradation caused by interference. This is beneficial for both cluster operators and service owners. However, deploying performance prediction systems with learning methods requires sophisticated safeguard mechanisms due… ▽ More Accurately predicting task performance at runtime in a cluster is advantageous for a resource management system to determine whether a task should be migrated due to performance degradation caused by interference. This is beneficial for both cluster operators and service owners. However, deploying performance prediction systems with learning methods requires sophisticated safeguard mechanisms due to the inherent stochastic and black-box natures of these models, such as Deep Neural Networks (DNNs). Vanilla Neural Networks (NNs) can be vulnerable to out-of-distribution data samples that can lead to sub-optimal decisions. To take a step towards a safe learning system in performance prediction, We propose vPALs that leverage well-correlated system metrics, and verification to produce safe performance prediction at runtime, providing an extra layer of safety to integrate learning techniques to cluster resource management systems. Our experiments show that vPALs can outperform vanilla NNs across our benchmark workload. △ Less

Submitted 3 April, 2024; originally announced April 2024.

Comments: presented at Deployable AI Workshop at AAAI-2024

arXiv:2404.00672 [pdf, other]

A General and Efficient Training for Transformer via Token Expansion

Authors: Wenxuan Huang, Yunhang Shen, Jiao Xie, Baochang Zhang, Gaoqi He, Ke Li, Xing Sun, Shaohui Lin

Abstract: The remarkable performance of Vision Transformers (ViTs) typically requires an extremely large training cost. Existing methods have attempted to accelerate the training of ViTs, yet typically disregard method universality with accuracy dropping. Meanwhile, they break the training consistency of the original transformers, including the consistency of hyper-parameters, architecture, and strategy, wh… ▽ More The remarkable performance of Vision Transformers (ViTs) typically requires an extremely large training cost. Existing methods have attempted to accelerate the training of ViTs, yet typically disregard method universality with accuracy dropping. Meanwhile, they break the training consistency of the original transformers, including the consistency of hyper-parameters, architecture, and strategy, which prevents them from being widely applied to different Transformer networks. In this paper, we propose a novel token growth scheme Token Expansion (termed ToE) to achieve consistent training acceleration for ViTs. We introduce an "initialization-expansion-merging" pipeline to maintain the integrity of the intermediate feature distribution of original transformers, preventing the loss of crucial learnable information in the training process. ToE can not only be seamlessly integrated into the training and fine-tuning process of transformers (e.g., DeiT and LV-ViT), but also effective for efficient training frameworks (e.g., EfficientTrain), without twisting the original training hyper-parameters, architecture, and introducing additional training strategies. Extensive experiments demonstrate that ToE achieves about 1.3x faster for the training of ViTs in a lossless manner, or even with performance gains over the full-token training baselines. Code is available at https://github.com/Osilly/TokenExpansion . △ Less

Submitted 31 March, 2024; originally announced April 2024.

Comments: Accepted to CVPR 2024. Code is available at https://github.com/Osilly/TokenExpansion

arXiv:2403.17842 [pdf, other]

Experimental Realization of Discrete Time Quasi-Crystals

Authors: Guanghui He, Bingtian Ye, Ruotian Gong, Changyu Yao, Zhongyuan Liu, Kater W. Murch, Norman Y. Yao, Chong Zu

Abstract: Floquet (periodically driven) systems can give rise to unique non-equilibrium phases of matter without equilibrium analogs. The most prominent example is the realization of discrete time crystals. An intriguing question emerges: what other novel phases can manifest when the constraint of time periodicity is relaxed? In this study, we explore quantum systems subjected to a quasi-periodic drive. Lev… ▽ More Floquet (periodically driven) systems can give rise to unique non-equilibrium phases of matter without equilibrium analogs. The most prominent example is the realization of discrete time crystals. An intriguing question emerges: what other novel phases can manifest when the constraint of time periodicity is relaxed? In this study, we explore quantum systems subjected to a quasi-periodic drive. Leveraging a strongly interacting spin ensemble in diamond, we identify the emergence of long-lived discrete time quasi-crystals. Unlike conventional time crystals, time quasi-crystals exhibit robust sub-harmonic responses at multiple incommensurate frequencies. Furthermore, we show that the multi-frequency nature of the quasi-periodic drive allows for the formation of diverse patterns associated with different discrete time quasi-crystalline phases. Our findings demonstrate the existence of non-equilibrium phases in quasi-Floquet settings, significantly broadening the catalog of novel phenomena in driven many-body quantum systems. △ Less

Submitted 26 March, 2024; originally announced March 2024.

Comments: 7+5 pages, 4+5 figures

arXiv:2403.16863 [pdf, other]

SIP: Autotuning GPU Native Schedules via Stochastic Instruction Perturbation

Authors: Guoliang He, Eiko Yoneki

Abstract: Large language models (LLMs) have become a significant workload since their appearance. However, they are also computationally expensive as they have billions of parameters and are trained with massive amounts of data. Thus, recent works have developed dedicated CUDA kernels for LLM training and inference instead of relying on compilergenerated ones, so that hardware resources are as fully utilize… ▽ More Large language models (LLMs) have become a significant workload since their appearance. However, they are also computationally expensive as they have billions of parameters and are trained with massive amounts of data. Thus, recent works have developed dedicated CUDA kernels for LLM training and inference instead of relying on compilergenerated ones, so that hardware resources are as fully utilized as possible. In this work, we explore the possibility of GPU native instruction optimization to further push the CUDA kernels to extreme performance. Contrary to prior works, we adopt an automatic optimization approach by defining a search space of possible GPU native instruction schedules, and then we apply stochastic search to perform optimization. Experiments show that SIP can further improve CUDA kernel throughput by automatically discovering better GPU native instruction schedules and the optimized schedules are tested by 10 million test samples. △ Less

Submitted 25 March, 2024; originally announced March 2024.

Comments: EuroMLSys 24, April 22, 2024, Athens, Greece

arXiv:2403.12707 [pdf, other]

Selective Domain-Invariant Feature for Generalizable Deepfake Detection

Authors: Yingxin Lai, Guoqing Yang Yifan He, Zhiming Luo, Shaozi Li

Abstract: With diverse presentation forgery methods emerging continually, detecting the authenticity of images has drawn growing attention. Although existing methods have achieved impressive accuracy in training dataset detection, they still perform poorly in the unseen domain and suffer from forgery of irrelevant information such as background and identity, affecting generalizability. To solve this problem… ▽ More With diverse presentation forgery methods emerging continually, detecting the authenticity of images has drawn growing attention. Although existing methods have achieved impressive accuracy in training dataset detection, they still perform poorly in the unseen domain and suffer from forgery of irrelevant information such as background and identity, affecting generalizability. To solve this problem, we proposed a novel framework Selective Domain-Invariant Feature (SDIF), which reduces the sensitivity to face forgery by fusing content features and styles. Specifically, we first use a Farthest-Point Sampling (FPS) training strategy to construct a task-relevant style sample representation space for fusing with content features. Then, we propose a dynamic feature extraction module to generate features with diverse styles to improve the performance and effectiveness of the feature extractor. Finally, a domain separation strategy is used to retain domain-related features to help distinguish between real and fake faces. Both qualitative and quantitative results in existing benchmarks and proposals demonstrate the effectiveness of our approach. △ Less

Submitted 19 March, 2024; originally announced March 2024.

Comments: Accepted by ICASSP 2024

arXiv:2403.10913 [pdf, other]

DEFA: Efficient Deformable Attention Acceleration via Pruning-Assisted Grid-Sampling and Multi-Scale Parallel Processing

Authors: Yansong Xu, Dongxu Lyu, Zhenyu Li, Zilong Wang, Yuzhou Chen, Gang Wang, Zhican Wang, Haomin Li, Guanghui He

Abstract: Multi-scale deformable attention (MSDeformAttn) has emerged as a key mechanism in various vision tasks, demonstrating explicit superiority attributed to multi-scale grid-sampling. However, this newly introduced operator incurs irregular data access and enormous memory requirement, leading to severe PE underutilization. Meanwhile, existing approaches for attention acceleration cannot be directly ap… ▽ More Multi-scale deformable attention (MSDeformAttn) has emerged as a key mechanism in various vision tasks, demonstrating explicit superiority attributed to multi-scale grid-sampling. However, this newly introduced operator incurs irregular data access and enormous memory requirement, leading to severe PE underutilization. Meanwhile, existing approaches for attention acceleration cannot be directly applied to MSDeformAttn due to lack of support for this distinct procedure. Therefore, we propose a dedicated algorithm-architecture co-design dubbed DEFA, the first-of-its-kind method for MSDeformAttn acceleration. At the algorithm level, DEFA adopts frequency-weighted pruning and probability-aware pruning for feature maps and sampling points respectively, alleviating the memory footprint by over 80%. At the architecture level, it explores the multi-scale parallelism to boost the throughput significantly and further reduces the memory access via fine-grained layer fusion and feature map reusing. Extensively evaluated on representative benchmarks, DEFA achieves 10.1-31.9x speedup and 20.3-37.7x energy efficiency boost compared to powerful GPUs. It also rivals the related accelerators by 2.2-3.7x energy efficiency improvement while providing pioneering support for MSDeformAttn. △ Less

Submitted 16 March, 2024; originally announced March 2024.

Comments: Accepted to DAC 2024

arXiv:2402.16051 [pdf, other]

Two-body hadronic weak decays of bottomed hadrons

Authors: Ying Zhang, Guangzhao He, Quanxing Ye, Da-Cheng Yan, Jun Hua, Qian Wang

Abstract: The structure of light diquarks plays a crucial role in the formation of exotic hadrons beyond the conventional quark model, especially in their line shapes of bottomed hadron decays. We study the two-body hadronic weak decays of bottomed baryons and bottomed mesons to probe the light diquark structure and pin down the quark-quark correlations in the diquark picture. We find that the light diquark… ▽ More The structure of light diquarks plays a crucial role in the formation of exotic hadrons beyond the conventional quark model, especially in their line shapes of bottomed hadron decays. We study the two-body hadronic weak decays of bottomed baryons and bottomed mesons to probe the light diquark structure and pin down the quark-quark correlations in the diquark picture. We find that the light diquark does not favor a compact structure. For instance, the isoscalar diquark $[ud]$ in $Λ_{b}^{0}$ can be easily split and rearranged to form $Σ_{c}^{(*)}\bar{D}^{(*)}$ via the color-suppressed transition. This provides a hint that the hidden charm pentaquark states produced in $Λ^0_b$ decays could be the $Σ_{c}^{(*)}\bar{D}^{(*)}$ hadronic molecular candidates. This quantitative study resolves the apparent conflicts between the production mechanism and molecular nature of these $P_c$ states observed in experiment. △ Less

Submitted 25 February, 2024; originally announced February 2024.

Comments: accepted by Chinese Physics Letter

arXiv:2402.15368 [pdf, other]

Safe Task Planning for Language-Instructed Multi-Robot Systems using Conformal Prediction

Authors: Jun Wang, Guocheng He, Yiannis Kantaros

Abstract: This paper addresses task planning problems for language-instructed robot teams. Tasks are expressed in natural language (NL), requiring the robots to apply their capabilities at various locations and semantic objects. Several recent works have addressed similar planning problems by leveraging pre-trained Large Language Models (LLMs) to design effective multi-robot plans. However, these approaches… ▽ More This paper addresses task planning problems for language-instructed robot teams. Tasks are expressed in natural language (NL), requiring the robots to apply their capabilities at various locations and semantic objects. Several recent works have addressed similar planning problems by leveraging pre-trained Large Language Models (LLMs) to design effective multi-robot plans. However, these approaches lack mission completion guarantees. To address this challenge, we introduce a new decentralized LLM-based planner, called S-ATLAS for Safe plAnning for Teams of Language-instructed AgentS, that is capable of achieving user-defined mission success rates. This is accomplished by leveraging conformal prediction (CP), a distribution-free uncertainty quantification tool in black-box models. CP allows the proposed multi-robot planner to reason about its inherent uncertainty in a decentralized fashion, enabling robots to make individual decisions when they are sufficiently certain and seek help otherwise. We show, both theoretically and empirically, that the proposed planner can achieve user-specified task success rates while minimizing the overall number of help requests. We provide comparative experiments against related works showing that our method is significantly more computational efficient and achieves lower help rates. The advantage of our algorithm over baselines becomes more pronounced with increasing robot team size. △ Less

Submitted 24 June, 2024; v1 submitted 23 February, 2024; originally announced February 2024.

arXiv:2402.14282 [pdf, other]

Extention of Bagging MARS with Group LASSO for Heterogeneous Treatment Effect Estimation

Authors: Guanwenqing He, Ke Wan, Kazushi Maruo, Toshio Shimokawa

Abstract: Recent years, large scale clinical data like patient surveys and medical record data are playing an increasing role in medical data science. These large-scale clinical data, collectively referred to as "real-world data (RWD)". It is expected to be widely used in large-scale observational studies of specific diseases, personal medicine or precise medicine, finding the responder of drugs or treatmen… ▽ More Recent years, large scale clinical data like patient surveys and medical record data are playing an increasing role in medical data science. These large-scale clinical data, collectively referred to as "real-world data (RWD)". It is expected to be widely used in large-scale observational studies of specific diseases, personal medicine or precise medicine, finding the responder of drugs or treatments. Applying RWD for estimating heterogeneous treat ment effect (HTE) has already been a trending topic. HTE has the potential to considerably impact the development of precision medicine by helping doctors make more informed precise treatment decisions and provide more personalized medical care. The statistical models used to estimate HTE is called treatment effect models. Powers et al. proposed a some treatment effect models for observational study, where they pointed out that the bagging causal MARS (BCM) performs outstanding compared to other models. While BCM has excellent performance, it still has room for improvement. In this paper, we proposed a new treatment effect model called shrinkage causal bagging MARS method to improve their shared basis conditional mean regression framework based on the following points: first, we estimated basis functions using transformed outcome, then applied the group LASSO method to optimize the model and estimate parameters. Besides, we are focusing on pursing better interpretability of model to improve the ethical acceptance. We designed simulations to verify the performance of our proposed method and our proposed method superior in mean square error and bias in most simulation settings. Also we applied it to real data set ACTG 175 to verify its usability, where our results are supported by previous studies. △ Less

Submitted 21 February, 2024; originally announced February 2024.

Comments: 19 pages, 9 figures

arXiv:2402.08874 [pdf, other]

Tree-Based Hard Attention with Self-Motivation for Large Language Models

Authors: Chenxi Lin, Jiayu Ren, Guoxiu He, Zhuoren Jiang, Haiyan Yu, Xiaomin Zhu

Abstract: While large language models (LLMs) excel at understanding and generating plain text, they are not specifically tailored to handle hierarchical text structures. Extracting the task-desired property from their natural language responses typically necessitates additional processing steps. In fact, selectively comprehending the hierarchical structure of large-scale text is pivotal to understanding its… ▽ More While large language models (LLMs) excel at understanding and generating plain text, they are not specifically tailored to handle hierarchical text structures. Extracting the task-desired property from their natural language responses typically necessitates additional processing steps. In fact, selectively comprehending the hierarchical structure of large-scale text is pivotal to understanding its substance. Aligning LLMs more closely with the classification or regression values of specific task through prompting also remains challenging. To this end, we propose a novel framework called Tree-Based Hard Attention with Self-Motivation for Large Language Models (TEAROOM). TEAROOM incorporates a tree-based hard attention mechanism for LLMs to process hierarchically structured text inputs. By leveraging prompting, it enables a frozen LLM to selectively focus on relevant leaves in relation to the root, generating a tailored symbolic representation of their relationship. Moreover, TEAROOM comprises a self-motivation strategy for another LLM equipped with a trainable adapter and a linear layer. The selected symbolic outcomes are integrated into another prompt, along with the predictive value of the task. We iteratively feed output values back into the prompt, enabling the trainable LLM to progressively approximate the golden truth. TEAROOM outperforms existing state-of-the-art methods in experimental evaluations across three benchmark datasets, showing its effectiveness in estimating task-specific properties. Through comprehensive experiments and analysis, we have validated the ability of TEAROOM to gradually approach the underlying golden truth through multiple inferences. △ Less

Submitted 13 February, 2024; originally announced February 2024.

arXiv:2402.05369 [pdf, other]

Noise Contrastive Alignment of Language Models with Explicit Rewards

Authors: Huayu Chen, Guande He, Lifan Yuan, Ganqu Cui, Hang Su, Jun Zhu

Abstract: User intentions are typically formalized as evaluation rewards to be maximized when fine-tuning language models (LMs). Existing alignment methods, such as Direct Preference Optimization (DPO), are mainly tailored for pairwise preference data where rewards are implicitly defined rather than explicitly given. In this paper, we introduce a general framework for LM alignment, leveraging Noise Contrast… ▽ More User intentions are typically formalized as evaluation rewards to be maximized when fine-tuning language models (LMs). Existing alignment methods, such as Direct Preference Optimization (DPO), are mainly tailored for pairwise preference data where rewards are implicitly defined rather than explicitly given. In this paper, we introduce a general framework for LM alignment, leveraging Noise Contrastive Estimation (NCE) to bridge the gap in handling reward datasets explicitly annotated with scalar evaluations. Our framework comprises two parallel algorithms, NCA and InfoNCA, both enabling the direct extraction of an LM policy from reward data as well as preference data. Notably, we show that the DPO loss is a special case of our proposed InfoNCA objective under pairwise preference settings, thereby integrating and extending current alignment theories. By comparing NCA and InfoNCA, we demonstrate that the well-observed decreasing-likelihood trend of DPO/InfoNCA is caused by their focus on adjusting relative likelihood across different responses. In contrast, NCA optimizes the absolute likelihood for each response, thereby effectively preventing the chosen likelihood from decreasing. We evaluate our methods in both reward and preference settings with Mistral-8*7B and 7B models. Experiments suggest that InfoNCA/NCA surpasses various preference baselines when reward datasets are available. We also find NCA significantly outperforms DPO in complex reasoning tasks like math and coding. △ Less

Submitted 3 July, 2024; v1 submitted 7 February, 2024; originally announced February 2024.

arXiv:2402.03708 [pdf, other]

SISP: A Benchmark Dataset for Fine-grained Ship Instance Segmentation in Panchromatic Satellite Images

Authors: Pengming Feng, Mingjie Xie, Hongning Liu, Xuanjia Zhao, Guangjun He, Xueliang Zhang, Jian Guan

Abstract: Fine-grained ship instance segmentation in satellite images holds considerable significance for monitoring maritime activities at sea. However, existing datasets often suffer from the scarcity of fine-grained information or pixel-wise localization annotations, as well as the insufficient image diversity and variations, thus limiting the research of this task. To this end, we propose a benchmark da… ▽ More Fine-grained ship instance segmentation in satellite images holds considerable significance for monitoring maritime activities at sea. However, existing datasets often suffer from the scarcity of fine-grained information or pixel-wise localization annotations, as well as the insufficient image diversity and variations, thus limiting the research of this task. To this end, we propose a benchmark dataset for fine-grained Ship Instance Segmentation in Panchromatic satellite images, namely SISP, which contains 56,693 well-annotated ship instances with four fine-grained categories across 10,000 sliced images, and all the images are collected from SuperView-1 satellite with the resolution of 0.5m. Targets in the proposed SISP dataset have characteristics that are consistent with real satellite scenes, such as high class imbalance, various scenes, large variations in target densities and scales, and high inter-class similarity and intra-class diversity, all of which make the SISP dataset more suitable for real-world applications. In addition, we introduce a Dynamic Feature Refinement-assist Instance segmentation network, namely DFRInst, as the benchmark method for ship instance segmentation in satellite images, which can fortify the explicit representation of crucial features, thus improving the performance of ship instance segmentation. Experiments and analysis are performed on the proposed SISP dataset to evaluate the benchmark method and several state-of-the-art methods to establish baselines for facilitating future research. The proposed dataset and source codes will be available at: https://github.com/Justlovesmile/SISP. △ Less

Submitted 6 February, 2024; originally announced February 2024.

Comments: 14 pages, 9 figures

arXiv:2402.02170 [pdf, ps, other]

Gravitational losses for the binary systems induced by the next-to-leading spin-orbit coupling effects

Authors: Hao Zhang, Wei Gao, Guansheng He, Siming Liu, Huanyu Jia, Wenbin Lin

Abstract: The orbital energy and momentum of the compact binary systems will loss due to gravitational radiation. Based on the mass and mass-current multipole moments of the binary system with the spin vector defined by Bohé et al. [Class. Quantum Grav. 30, 075017 (2013)], we calculate the loss rates of energy, angular and linear momentum induced by the next-to-leading spin-orbit effects. For the case of ci… ▽ More The orbital energy and momentum of the compact binary systems will loss due to gravitational radiation. Based on the mass and mass-current multipole moments of the binary system with the spin vector defined by Bohé et al. [Class. Quantum Grav. 30, 075017 (2013)], we calculate the loss rates of energy, angular and linear momentum induced by the next-to-leading spin-orbit effects. For the case of circular orbit, the formulations for these losses are given in terms of the orbital frequency. △ Less

Submitted 3 February, 2024; originally announced February 2024.

Comments: 18 pages

arXiv:2402.01548 [pdf, other]

Gravitational lensing of massive particles by a black-bounce-Schwarzschild black hole

Authors: Guansheng He, Yi Xie, Chunhua Jiang, Wenbin Lin

Abstract: We investigate in detail the weak-field gravitational lensing of a relativistic neutral massive particle induced by a regular black-bounce-Schwarzschild black hole proposed by Simpson and Visser. Starting with the calculation of the gravitational deflection of the massive particle up to the third post-Minkowskian order, the Virbhadra-Ellis lens equation is solved perturbatively beyond the weak-def… ▽ More We investigate in detail the weak-field gravitational lensing of a relativistic neutral massive particle induced by a regular black-bounce-Schwarzschild black hole proposed by Simpson and Visser. Starting with the calculation of the gravitational deflection of the massive particle up to the third post-Minkowskian order, the Virbhadra-Ellis lens equation is solved perturbatively beyond the weak-deflection limit to achieve the expressions for the lensing observables of the primary and secondary images of a point-like particle source. The main observables contain not only the positions, the flux magnifications, and the gravitational time delays of the individual images, but also the positional relations, the magnification relations (including the total magnification), the magnification centroid, and the differential time delay. We then discuss the velocity-induced effects originated from the deviation of the particle's initial velocity from the speed of light on the black-bounce-Schwarzschild lensing observables of the images of a point-like light source, and the effects induced by the bounce parameter of the spacetime on the measurable image properties of Schwarzschild lensing of the massive particle. As an application of the results, we model the supermassive black hole in the Galactic Center (i.e., Sgr A$^{\ast}$) as the lens, and focus on evaluating the possibilities to detect the new velocity-induced and bounce-induced effects on the practical lensing observables and analyzing the dependence of these effects on the parameters. △ Less

Submitted 21 July, 2024; v1 submitted 2 February, 2024; originally announced February 2024.

Comments: Accepted for publication in PRD

arXiv:2401.17583 [pdf, other]

Agile But Safe: Learning Collision-Free High-Speed Legged Locomotion

Authors: Tairan He, Chong Zhang, Wenli Xiao, Guanqi He, Changliu Liu, Guanya Shi

Abstract: Legged robots navigating cluttered environments must be jointly agile for efficient task execution and safe to avoid collisions with obstacles or humans. Existing studies either develop conservative controllers (< 1.0 m/s) to ensure safety, or focus on agility without considering potentially fatal collisions. This paper introduces Agile But Safe (ABS), a learning-based control framework that enabl… ▽ More Legged robots navigating cluttered environments must be jointly agile for efficient task execution and safe to avoid collisions with obstacles or humans. Existing studies either develop conservative controllers (< 1.0 m/s) to ensure safety, or focus on agility without considering potentially fatal collisions. This paper introduces Agile But Safe (ABS), a learning-based control framework that enables agile and collision-free locomotion for quadrupedal robots. ABS involves an agile policy to execute agile motor skills amidst obstacles and a recovery policy to prevent failures, collaboratively achieving high-speed and collision-free navigation. The policy switch in ABS is governed by a learned control-theoretic reach-avoid value network, which also guides the recovery policy as an objective function, thereby safeguarding the robot in a closed loop. The training process involves the learning of the agile policy, the reach-avoid value network, the recovery policy, and an exteroception representation network, all in simulation. These trained modules can be directly deployed in the real world with onboard sensing and computation, leading to high-speed and collision-free navigation in confined indoor and outdoor spaces with both static and dynamic obstacles. △ Less

Submitted 21 May, 2024; v1 submitted 30 January, 2024; originally announced January 2024.

Comments: Published at RSS 2024, Project website: https://agile-but-safe.github.io/

arXiv:2401.16102 [pdf, other]

doi 10.1016/j.energy.2024.132840

Flexible Parallel Neural Network Architecture Model for Early Prediction of Lithium Battery Life

Authors: Lidang Jiang, Zhuoxiang Li, Changyan Hu, Qingsong Huang, Ge He

Abstract: The early prediction of battery life (EPBL) is vital for enhancing the efficiency and extending the lifespan of lithium batteries. Traditional models with fixed architectures often encounter underfitting or overfitting issues due to the diverse data distributions in different EPBL tasks. An interpretable deep learning model of flexible parallel neural network (FPNN) is proposed, which includes an… ▽ More The early prediction of battery life (EPBL) is vital for enhancing the efficiency and extending the lifespan of lithium batteries. Traditional models with fixed architectures often encounter underfitting or overfitting issues due to the diverse data distributions in different EPBL tasks. An interpretable deep learning model of flexible parallel neural network (FPNN) is proposed, which includes an InceptionBlock, a 3D convolutional neural network (CNN), a 2D CNN, and a dual-stream network. The proposed model effectively extracts electrochemical features from video-like formatted data using the 3D CNN and achieves advanced multi-scale feature abstraction through the InceptionBlock. The FPNN can adaptively adjust the number of InceptionBlocks to flexibly handle tasks of varying complexity in EPBL. The test on the MIT dataset shows that the FPNN model achieves outstanding predictive accuracy in EPBL tasks, with MAPEs of 2.47%, 1.29%, 1.08%, and 0.88% when the input cyclic data volumes are 10, 20, 30, and 40, respectively. The interpretability of the FPNN is mainly reflected in its flexible unit structure and parameter selection: its diverse branching structure enables the model to capture features at different scales, thus allowing the machine to learn informative features. The approach presented herein provides an accurate, adaptable, and comprehensible solution for early life prediction of lithium batteries, opening new possibilities in the field of battery health monitoring. △ Less

Submitted 29 January, 2024; originally announced January 2024.

arXiv:2401.14781 [pdf, other]

Simulated TEM imaging of a heavily irradiated metal

Authors: D. R. Mason, M. Boleininger, J. Haley, E. Prestat, G. He, F. Hofmann, S. L. Dudarev

Abstract: We recast the Howie-Whelan equations for generating simulated transmission electron microscope (TEM) images, replacing the dependence on local atomic displacements with atomic positions only. This allows very rapid computation of simulated TEM images for arbitrarily complex atomistic configurations of lattice defects and dislocations in the dynamical two beam approximation. Large scale massively-o… ▽ More We recast the Howie-Whelan equations for generating simulated transmission electron microscope (TEM) images, replacing the dependence on local atomic displacements with atomic positions only. This allows very rapid computation of simulated TEM images for arbitrarily complex atomistic configurations of lattice defects and dislocations in the dynamical two beam approximation. Large scale massively-overlapping cascade simulations performed with molecular dynamics, are used to generate representative high-dose nanoscale irradiation damage in tungsten at room temperature, and we compare the simulated TEM images to experimental TEM images with similar irradiation and imaging conditions. The simulated TEM shows 'white-dot' damage in weak-beam dark-field imaging conditions, in line with our experimental observations and as expected from previous studies, and in bright-field conditions a dislocation network is observed. In this work we can also compare the images to the nanoscale lattice defects in the original atomic structures, and find that at high dose the white spots are not only created by small dislocation loops, but rather arise from nanoscale fluctuations in strains around curved sections of dislocation lines. △ Less

Submitted 26 January, 2024; originally announced January 2024.

arXiv:2401.14734 [pdf, other]

Anomalous electron-phonon coupling in kagome ferromagnetic Weyl semimetal Co$_3$Sn$_2$S$_2$

Authors: G. He, M. Kute, Z. C. Xu, L. Peis, R. Stumberger, A. Baum, D. Jost, E. M. Been, B. Moritz, Y. G. Shi, T. P. Devereaux, R. Hackl

Abstract: We present results of a Raman scattering study of the Kagome ferromagnet Co$_3$Sn$_2$S$_2$, with a focus on electronic and phononic excitations and their interplay. In addition, the electronic band structure is analyzed theoretically, enabling a semi-quantitative explanation of the spectra. A prominent feature in the electronic spectra is a redistribution of spectral weight from low to high energi… ▽ More We present results of a Raman scattering study of the Kagome ferromagnet Co$_3$Sn$_2$S$_2$, with a focus on electronic and phononic excitations and their interplay. In addition, the electronic band structure is analyzed theoretically, enabling a semi-quantitative explanation of the spectra. A prominent feature in the electronic spectra is a redistribution of spectral weight from low to high energies starting at the Curie temperature Tc. The Raman intensity is suppressed below approximately 1000cm$^{-1}$ and increases above to a peak at 2000 cm$^{-1}$ in all symmetries. Two Raman active phonon modes are identified in A$_{1g}$ and E$_g$ symmetry. The A$_{1g}$ phonon couples strongly to the electronic continuum as indicated by the asymmetric Fano-type line shape. The asymmetry depends non-monotonically on temperature and is maximal close to the magnetic transition. In the limit $T\to 0$ the phonon is nearly symmetric. The evolution of the coupling strength and the electronic continuum as a function of temperature is attributed to a band splitting induced by the ferromagnetic phase transition which substantially reduces the DOS towards $T=0$. The $3d_{z^2}$ electrons of the Co atoms in the crystal field modulated by the A$_{1g}$ phonon are implied to be a critical component contributing to the strong electron-phonon coupling of that phonon. These results allow a comprehensive understanding of the bulk band structure evolution as a function of temperature in Co$_3$Sn$_2$S$_2$, offering key insights for further studies of the driving force behind the long-range magnetic order and novel topological states in this compound. △ Less

Submitted 26 January, 2024; originally announced January 2024.

Comments: 9 pages, 4 figures

arXiv:2401.10150 [pdf, other]

Motion-Zero: Zero-Shot Moving Object Control Framework for Diffusion-Based Video Generation

Authors: Changgu Chen, Junwei Shu, Lianggangxu Chen, Gaoqi He, Changbo Wang, Yang Li

Abstract: Recent large-scale pre-trained diffusion models have demonstrated a powerful generative ability to produce high-quality videos from detailed text descriptions. However, exerting control over the motion of objects in videos generated by any video diffusion model is a challenging problem. In this paper, we propose a novel zero-shot moving object trajectory control framework, Motion-Zero, to enable a… ▽ More Recent large-scale pre-trained diffusion models have demonstrated a powerful generative ability to produce high-quality videos from detailed text descriptions. However, exerting control over the motion of objects in videos generated by any video diffusion model is a challenging problem. In this paper, we propose a novel zero-shot moving object trajectory control framework, Motion-Zero, to enable a bounding-box-trajectories-controlled text-to-video diffusion model. To this end, an initial noise prior module is designed to provide a position-based prior to improve the stability of the appearance of the moving object and the accuracy of position. In addition, based on the attention map of the U-net, spatial constraints are directly applied to the denoising process of diffusion models, which further ensures the positional and spatial consistency of moving objects during the inference. Furthermore, temporal consistency is guaranteed with a proposed shift temporal attention mechanism. Our method can be flexibly applied to various state-of-the-art video diffusion models without any training process. Extensive experiments demonstrate our proposed method can control the motion trajectories of objects and generate high-quality videos. △ Less

Submitted 21 January, 2024; v1 submitted 18 January, 2024; originally announced January 2024.

Comments: Preprint

arXiv:2401.07369 [pdf, other]

CoVO-MPC: Theoretical Analysis of Sampling-based MPC and Optimal Covariance Design

Authors: Zeji Yi, Chaoyi Pan, Guanqi He, Guannan Qu, Guanya Shi

Abstract: Sampling-based Model Predictive Control (MPC) has been a practical and effective approach in many domains, notably model-based reinforcement learning, thanks to its flexibility and parallelizability. Despite its appealing empirical performance, the theoretical understanding, particularly in terms of convergence analysis and hyperparameter tuning, remains absent. In this paper, we characterize the… ▽ More Sampling-based Model Predictive Control (MPC) has been a practical and effective approach in many domains, notably model-based reinforcement learning, thanks to its flexibility and parallelizability. Despite its appealing empirical performance, the theoretical understanding, particularly in terms of convergence analysis and hyperparameter tuning, remains absent. In this paper, we characterize the convergence property of a widely used sampling-based MPC method, Model Predictive Path Integral Control (MPPI). We show that MPPI enjoys at least linear convergence rates when the optimization is quadratic, which covers time-varying LQR systems. We then extend to more general nonlinear systems. Our theoretical analysis directly leads to a novel sampling-based MPC algorithm, CoVariance-Optimal MPC (CoVo-MPC) that optimally schedules the sampling covariance to optimize the convergence rate. Empirically, CoVo-MPC significantly outperforms standard MPPI by 43-54% in both simulations and real-world quadrotor agile control tasks. Videos and Appendices are available at \url{https://lecar-lab.github.io/CoVO-MPC/}. △ Less

Submitted 14 January, 2024; originally announced January 2024.

Comments: 32 pages, 4 figures

arXiv:2401.06418 [pdf, other]

Manipulating multiple optical parametric processes in photonic topological insulators

Authors: Zhen Jiang, Bo Ji, Yanghe Chen, Chun Jiang, Guangqiang He

Abstract: Topological quantum optics, an emerging area of study, holds the potential to bring about substantial enhancements for integrated quantum devices. Here we propose integrated topological quantum devices performing various functions including optical parametric amplification, frequency division, and frequency entangled biphoton generation. We show two distinct edge modes corresponding to different f… ▽ More Topological quantum optics, an emerging area of study, holds the potential to bring about substantial enhancements for integrated quantum devices. Here we propose integrated topological quantum devices performing various functions including optical parametric amplification, frequency division, and frequency entangled biphoton generation. We show two distinct edge modes corresponding to different frequency ranges in both sandwich kagome and honeycomb topological designs that emulate the quantum valley Hall effect. These two topological edge modes enable two types of optical parametric processes through four-wave mixing, specifically inter-band and intra-band cases. The devices emulating photonic valley-Hall insulators allow the frequency division of two transverse modes, and furthermore, enable the separation of two quantum functionalities - optical parametric amplification and frequency entangled biphoton state generation. More importantly, the parametric processes are inborn topological protected, showing robustness against sharp bends and disorders. Our proposal significantly widens the possibilities for robust, multifunctional topological quantum devices on-chip, which may find applications in quantum information processing. △ Less

Submitted 12 January, 2024; originally announced January 2024.

Comments: 18pages, 12 figures

arXiv:2401.02797 [pdf, other]

PeFoMed: Parameter Efficient Fine-tuning of Multimodal Large Language Models for Medical Imaging

Authors: Gang Liu, Jinlong He, Pengfei Li, Genrong He, Zhaolin Chen, Shenjun Zhong

Abstract: Multimodal large language models (MLLMs) represent an evolutionary expansion in the capabilities of traditional large language models, enabling them to tackle challenges that surpass the scope of purely text-based applications. It leverages the knowledge previously encoded within these language models, thereby enhancing their applicability and functionality in the reign of multimodal contexts. Rec… ▽ More Multimodal large language models (MLLMs) represent an evolutionary expansion in the capabilities of traditional large language models, enabling them to tackle challenges that surpass the scope of purely text-based applications. It leverages the knowledge previously encoded within these language models, thereby enhancing their applicability and functionality in the reign of multimodal contexts. Recent works investigate the adaptation of MLLMs as a universal solution to address medical multi-modal problems as a generative task. In this paper, we propose a parameter efficient framework for fine-tuning MLLMs, specifically validated on medical visual question answering (Med-VQA) and medical report generation (MRG) tasks, using public benchmark datasets. We also introduce an evaluation metric using the 5-point Likert scale and its weighted average value to measure the quality of the generated reports for MRG tasks, where the scale ratings are labelled by both humans manually and the GPT-4 model. We further assess the consistency of performance metrics across traditional measures, GPT-4, and human ratings for both VQA and MRG tasks. The results indicate that semantic similarity assessments using GPT-4 align closely with human annotators and provide greater stability, yet they reveal a discrepancy when compared to conventional lexical similarity measurements. This questions the reliability of lexical similarity metrics for evaluating the performance of generative models in Med-VQA and report generation tasks. Besides, our fine-tuned model significantly outperforms GPT-4v. This indicates that without additional fine-tuning, multi-modal models like GPT-4v do not perform effectively on medical imaging tasks. The code will be available here: https://github.com/jinlHe/PeFoMed. △ Less

Submitted 16 April, 2024; v1 submitted 5 January, 2024; originally announced January 2024.

Comments: 12 pages, 8 figures, 12 tables

arXiv:2312.09556 [pdf, other]

Optical Ranging Using Coherent Kerr Soliton Dual-microcombs with Extended Ambiguity Distance

Authors: Yuechen Yang, Yang Shen, Kailu Zhou, Chenhua Hu, Yuanzhuo Ding, Tinghao Jiang, Wei Li, Yudong Li, Liangsen Feng, Tengfei Wu, Guangqiang He

Abstract: Optical ranging is a key technology in metrology. Optical frequency combs are shown to provide several advantages in light ranging, offering high precision with high acquisition rate. However, performance of traditional ranging systems based on microcombs is limited by the short ambiguity distance and non-real-time processing. Here, we show that dual-comb ranging system using coherent Kerr soliton… ▽ More Optical ranging is a key technology in metrology. Optical frequency combs are shown to provide several advantages in light ranging, offering high precision with high acquisition rate. However, performance of traditional ranging systems based on microcombs is limited by the short ambiguity distance and non-real-time processing. Here, we show that dual-comb ranging system using coherent Kerr soliton microcombs and optical switch realizes extended ambiguity distance and provides a route to real-time processing. The ambguity distance is extended to 3.28 m from about 1.5 mm and the uncertainty reaches about 1.05 times 10^-7, while the system is compatible with low-bandwidth detectors. Combining coherent microcomb ranging systems with special FPGA could enable comb-based real-time ranging systems for several applications such as industrial process monitoring. △ Less

Submitted 15 December, 2023; originally announced December 2023.

Comments: 9 pages, 5 figures

arXiv:2312.08200 [pdf, other]

SPD-DDPM: Denoising Diffusion Probabilistic Models in the Symmetric Positive Definite Space

Authors: Yunchen Li, Zhou Yu, Gaoqi He, Yunhang Shen, Ke Li, Xing Sun, Shaohui Lin

Abstract: Symmetric positive definite~(SPD) matrices have shown important value and applications in statistics and machine learning, such as FMRI analysis and traffic prediction. Previous works on SPD matrices mostly focus on discriminative models, where predictions are made directly on $E(X|y)$, where $y$ is a vector and $X$ is an SPD matrix. However, these methods are challenging to handle for large-scale… ▽ More Symmetric positive definite~(SPD) matrices have shown important value and applications in statistics and machine learning, such as FMRI analysis and traffic prediction. Previous works on SPD matrices mostly focus on discriminative models, where predictions are made directly on $E(X|y)$, where $y$ is a vector and $X$ is an SPD matrix. However, these methods are challenging to handle for large-scale data, as they need to access and process the whole data. In this paper, inspired by denoising diffusion probabilistic model~(DDPM), we propose a novel generative model, termed SPD-DDPM, by introducing Gaussian distribution in the SPD space to estimate $E(X|y)$. Moreover, our model is able to estimate $p(X)$ unconditionally and flexibly without giving $y$. On the one hand, the model conditionally learns $p(X|y)$ and utilizes the mean of samples to obtain $E(X|y)$ as a prediction. On the other hand, the model unconditionally learns the probability distribution of the data $p(X)$ and generates samples that conform to this distribution. Furthermore, we propose a new SPD net which is much deeper than the previous networks and allows for the inclusion of conditional factors. Experiment results on toy data and real taxi data demonstrate that our models effectively fit the data distribution both unconditionally and unconditionally and provide accurate predictions. △ Less

Submitted 13 December, 2023; originally announced December 2023.

Comments: AAAI2024

arXiv:2312.06575 [pdf, other]

doi 10.1145/3610543.3626173

EasyVolcap: Accelerating Neural Volumetric Video Research

Authors: Zhen Xu, Tao Xie, Sida Peng, Haotong Lin, Qing Shuai, Zhiyuan Yu, Guangzhao He, Jiaming Sun, Hujun Bao, Xiaowei Zhou

Abstract: Volumetric video is a technology that digitally records dynamic events such as artistic performances, sporting events, and remote conversations. When acquired, such volumography can be viewed from any viewpoint and timestamp on flat screens, 3D displays, or VR headsets, enabling immersive viewing experiences and more flexible content creation in a variety of applications such as sports broadcastin… ▽ More Volumetric video is a technology that digitally records dynamic events such as artistic performances, sporting events, and remote conversations. When acquired, such volumography can be viewed from any viewpoint and timestamp on flat screens, 3D displays, or VR headsets, enabling immersive viewing experiences and more flexible content creation in a variety of applications such as sports broadcasting, video conferencing, gaming, and movie productions. With the recent advances and fast-growing interest in neural scene representations for volumetric video, there is an urgent need for a unified open-source library to streamline the process of volumetric video capturing, reconstruction, and rendering for both researchers and non-professional users to develop various algorithms and applications of this emerging technology. In this paper, we present EasyVolcap, a Python & Pytorch library for accelerating neural volumetric video research with the goal of unifying the process of multi-view data processing, 4D scene reconstruction, and efficient dynamic volumetric video rendering. Our source code is available at https://github.com/zju3dv/EasyVolcap. △ Less

Submitted 11 December, 2023; originally announced December 2023.

Comments: SIGGRAPH Asia 2023 Technical Communications. Source code: https://github.com/zju3dv/EasyVolcap

arXiv:2312.06550 [pdf, other]

LLM360: Towards Fully Transparent Open-Source LLMs

Authors: Zhengzhong Liu, Aurick Qiao, Willie Neiswanger, Hongyi Wang, Bowen Tan, Tianhua Tao, Junbo Li, Yuqi Wang, Suqi Sun, Omkar Pangarkar, Richard Fan, Yi Gu, Victor Miller, Yonghao Zhuang, Guowei He, Haonan Li, Fajri Koto, Liping Tang, Nikhil Ranjan, Zhiqiang Shen, Xuguang Ren, Roberto Iriondo, Cun Mu, Zhiting Hu, Mark Schulze , et al. (3 additional authors not shown)

Abstract: The recent surge in open-source Large Language Models (LLMs), such as LLaMA, Falcon, and Mistral, provides diverse options for AI practitioners and researchers. However, most LLMs have only released partial artifacts, such as the final model weights or inference code, and technical reports increasingly limit their scope to high-level design choices and surface statistics. These choices hinder prog… ▽ More The recent surge in open-source Large Language Models (LLMs), such as LLaMA, Falcon, and Mistral, provides diverse options for AI practitioners and researchers. However, most LLMs have only released partial artifacts, such as the final model weights or inference code, and technical reports increasingly limit their scope to high-level design choices and surface statistics. These choices hinder progress in the field by degrading transparency into the training of LLMs and forcing teams to rediscover many details in the training process. We present LLM360, an initiative to fully open-source LLMs, which advocates for all training code and data, model checkpoints, and intermediate results to be made available to the community. The goal of LLM360 is to support open and collaborative AI research by making the end-to-end LLM training process transparent and reproducible by everyone. As a first step of LLM360, we release two 7B parameter LLMs pre-trained from scratch, Amber and CrystalCoder, including their training code, data, intermediate checkpoints, and analyses (at https://www.llm360.ai). We are committed to continually pushing the boundaries of LLMs through this open-source effort. More large-scale and stronger models are underway and will be released in the future. △ Less

Submitted 11 December, 2023; originally announced December 2023.

arXiv:2312.03491 [pdf, other]

Schrodinger Bridges Beat Diffusion Models on Text-to-Speech Synthesis

Authors: Zehua Chen, Guande He, Kaiwen Zheng, Xu Tan, Jun Zhu

Abstract: In text-to-speech (TTS) synthesis, diffusion models have achieved promising generation quality. However, because of the pre-defined data-to-noise diffusion process, their prior distribution is restricted to a noisy representation, which provides little information of the generation target. In this work, we present a novel TTS system, Bridge-TTS, making the first attempt to substitute the noisy Gau… ▽ More In text-to-speech (TTS) synthesis, diffusion models have achieved promising generation quality. However, because of the pre-defined data-to-noise diffusion process, their prior distribution is restricted to a noisy representation, which provides little information of the generation target. In this work, we present a novel TTS system, Bridge-TTS, making the first attempt to substitute the noisy Gaussian prior in established diffusion-based TTS methods with a clean and deterministic one, which provides strong structural information of the target. Specifically, we leverage the latent representation obtained from text input as our prior, and build a fully tractable Schrodinger bridge between it and the ground-truth mel-spectrogram, leading to a data-to-data process. Moreover, the tractability and flexibility of our formulation allow us to empirically study the design spaces such as noise schedules, as well as to develop stochastic and deterministic samplers. Experimental results on the LJ-Speech dataset illustrate the effectiveness of our method in terms of both synthesis quality and sampling efficiency, significantly outperforming our diffusion counterpart Grad-TTS in 50-step/1000-step synthesis and strong fast TTS models in few-step scenarios. Project page: https://bridge-tts.github.io/ △ Less

Submitted 6 December, 2023; originally announced December 2023.

arXiv:2311.11302 [pdf, other]

doi 10.1109/TGRS.2023.3327780

Exchanging Dual Encoder-Decoder: A New Strategy for Change Detection with Semantic Guidance and Spatial Localization

Authors: Sijie Zhao, Xueliang Zhang, Pengfeng Xiao, Guangjun He

Abstract: Change detection is a critical task in earth observation applications. Recently, deep learning-based methods have shown promising performance and are quickly adopted in change detection. However, the widely used multiple encoder and single decoder (MESD) as well as dual encoder-decoder (DED) architectures still struggle to effectively handle change detection well. The former has problems of bitemp… ▽ More Change detection is a critical task in earth observation applications. Recently, deep learning-based methods have shown promising performance and are quickly adopted in change detection. However, the widely used multiple encoder and single decoder (MESD) as well as dual encoder-decoder (DED) architectures still struggle to effectively handle change detection well. The former has problems of bitemporal feature interference in the feature-level fusion, while the latter is inapplicable to intraclass change detection and multiview building change detection. To solve these problems, we propose a new strategy with an exchanging dual encoder-decoder structure for binary change detection with semantic guidance and spatial localization. The proposed strategy solves the problems of bitemporal feature inference in MESD by fusing bitemporal features in the decision level and the inapplicability in DED by determining changed areas using bitemporal semantic features. We build a binary change detection model based on this strategy, and then validate and compare it with 18 state-of-the-art change detection methods on six datasets in three scenarios, including intraclass change detection datasets (CDD, SYSU), single-view building change detection datasets (WHU, LEVIR-CD, LEVIR-CD+) and a multiview building change detection dataset (NJDS). The experimental results demonstrate that our model achieves superior performance with high efficiency and outperforms all benchmark methods with F1-scores of 97.77%, 83.07%, 94.86%, 92.33%, 91.39%, 74.35% on CDD, SYSU, WHU, LEVIR-CD, LEVIR- CD+, and NJDS datasets, respectively. The code of this work will be available at https://github.com/NJU-LHRS/official-SGSLN. △ Less

Submitted 19 November, 2023; originally announced November 2023.

Journal ref: IEEE Transactions on Geoscience and Remote Sensing, vol. 61, pp. 1-16, 2023, Art no. 4508016

arXiv:2311.09262 [pdf, other]

Disentangling the Potential Impacts of Papers into Diffusion, Conformity, and Contribution Values

Authors: Zhikai Xue, Guoxiu He, Zhuoren Jiang, Sichen Gu, Yangyang Kang, Star Zhao, Wei Lu

Abstract: The potential impact of an academic paper is determined by various factors, including its popularity and contribution. Existing models usually estimate original citation counts based on static graphs and fail to differentiate values from nuanced perspectives. In this study, we propose a novel graph neural network to Disentangle the Potential impacts of Papers into Diffusion, Conformity, and Contri… ▽ More The potential impact of an academic paper is determined by various factors, including its popularity and contribution. Existing models usually estimate original citation counts based on static graphs and fail to differentiate values from nuanced perspectives. In this study, we propose a novel graph neural network to Disentangle the Potential impacts of Papers into Diffusion, Conformity, and Contribution values (called DPPDCC). Given a target paper, DPPDCC encodes temporal and structural features within the constructed dynamic heterogeneous graph. Particularly, to capture the knowledge flow, we emphasize the importance of comparative and co-cited/citing information between papers and aggregate snapshots evolutionarily. To unravel popularity, we contrast augmented graphs to extract the essence of diffusion and predict the accumulated citation binning to model conformity. We further apply orthogonal constraints to encourage distinct modeling of each perspective and preserve the inherent value of contribution. To evaluate models' generalization for papers published at various times, we reformulate the problem by partitioning data based on specific time points to mirror real-world conditions. Extensive experimental results on three datasets demonstrate that DPPDCC significantly outperforms baselines for previously, freshly, and immediately published papers. Further analyses confirm its robust capabilities. We will make our datasets and codes publicly available. △ Less

Submitted 21 May, 2024; v1 submitted 15 November, 2023; originally announced November 2023.

Comments: Update and correct some references. This paper is still in progress

arXiv:2310.15629 [pdf, other]

On-chip topological transport of optical frequency combs in silicon-based valley photonic crystals

Authors: Zhen Jiang, Hongwei Wang, Yuechen Yang, Yang Shen, Bo Ji, Yanghe Chen, Yong Zhang, Lu Sun, Zheng Wang, Chun Jiang, Yikai Su, Guangqiang He

Abstract: The generation and control of optical frequency combs in integrated photonic systems enables complex, high-controllable, and large-scale devices. In parallel, harnessing topological physics in multipartite systems has allowed them with compelling features such as robustness against fabrication imperfections. Here we experimentally demonstrate on-chip topological transport for optical frequency com… ▽ More The generation and control of optical frequency combs in integrated photonic systems enables complex, high-controllable, and large-scale devices. In parallel, harnessing topological physics in multipartite systems has allowed them with compelling features such as robustness against fabrication imperfections. Here we experimentally demonstrate on-chip topological transport for optical frequency combs at telecommunication wavelengths, both in classical and nonclassical domains. We access both the quantum frequency combs and dissipative Kerr soliton combs with a micro-resonator. The quantum frequency comb, that is, a coherent superposition of multiple frequency modes, is proven to be a frequency-entangled qudit state. We also show that dissipative Kerr soliton combs are highly coherent and mode-locked due to the collective coherence or self-organization of solitons. Moreover, the valley kink states allow both quantum frequency combs and dissipative Kerr soliton combs with robustness against sharp bends. Our topologically protected optical frequency combs could enable the inherent robustness in integrated complex photonic systems. △ Less

Submitted 24 October, 2023; originally announced October 2023.

Comments: 20 pages,12 figures

arXiv:2310.14718 [pdf, other]

Rethinking Scale Imbalance in Semi-supervised Object Detection for Aerial Images

Authors: Ruixiang Zhang, Chang Xu, Fang Xu, Wen Yang, Guangjun He, Huai Yu, Gui-Song Xia

Abstract: This paper focuses on the scale imbalance problem of semi-supervised object detection(SSOD) in aerial images. Compared to natural images, objects in aerial images show smaller sizes and larger quantities per image, increasing the difficulty of manual annotation. Meanwhile, the advanced SSOD technique can train superior detectors by leveraging limited labeled data and massive unlabeled data, saving… ▽ More This paper focuses on the scale imbalance problem of semi-supervised object detection(SSOD) in aerial images. Compared to natural images, objects in aerial images show smaller sizes and larger quantities per image, increasing the difficulty of manual annotation. Meanwhile, the advanced SSOD technique can train superior detectors by leveraging limited labeled data and massive unlabeled data, saving annotation costs. However, as an understudied task in aerial images, SSOD suffers from a drastic performance drop when facing a large proportion of small objects. By analyzing the predictions between small and large objects, we identify three imbalance issues caused by the scale bias, i.e., pseudo-label imbalance, label assignment imbalance, and negative learning imbalance. To tackle these issues, we propose a novel Scale-discriminative Semi-Supervised Object Detection (S^3OD) learning pipeline for aerial images. In our S^3OD, three key components, Size-aware Adaptive Thresholding (SAT), Size-rebalanced Label Assignment (SLA), and Teacher-guided Negative Learning (TNL), are proposed to warrant scale unbiased learning. Specifically, SAT adaptively selects appropriate thresholds to filter pseudo-labels for objects at different scales. SLA balances positive samples of objects at different scales through resampling and reweighting. TNL alleviates the imbalance in negative samples by leveraging information generated by a teacher model. Extensive experiments conducted on the DOTA-v1.5 benchmark demonstrate the superiority of our proposed methods over state-of-the-art competitors. Codes will be released soon. △ Less

Submitted 23 October, 2023; originally announced October 2023.

Showing 1–50 of 304 results for author: He, G