Skip to main content

Showing 1–50 of 694 results for author: Zhang, S

Searching in archive eess. Search in all archives.
.
  1. arXiv:2408.17431  [pdf, other

    eess.AS cs.AI

    Advancing Multi-talker ASR Performance with Large Language Models

    Authors: Mohan Shi, Zengrui Jin, Yaoxun Xu, Yong Xu, Shi-Xiong Zhang, Kun Wei, Yiwen Shao, Chunlei Zhang, Dong Yu

    Abstract: Recognizing overlapping speech from multiple speakers in conversational scenarios is one of the most challenging problem for automatic speech recognition (ASR). Serialized output training (SOT) is a classic method to address multi-talker ASR, with the idea of concatenating transcriptions from multiple speakers according to the emission times of their speech for training. However, SOT-style transcr… ▽ More

    Submitted 30 August, 2024; originally announced August 2024.

    Comments: 8 pages, accepted by IEEE SLT 2024

  2. arXiv:2408.15252  [pdf, other

    eess.SP cs.AI

    Generative AI on SpectrumNet: An Open Benchmark of Multiband 3D Radio Maps

    Authors: Shuhang Zhang, Shuai Jiang, Wanjie Lin, Zheng Fang, Kangjun Liu, Hongliang Zhang, Ke Chen

    Abstract: Radio map is an efficient demonstration for visually displaying the wireless signal coverage within a certain region. It has been considered to be increasingly helpful for the future sixth generation (6G) of wireless networks, as wireless nodes are becoming more crowded and complicated. However, the construction of high resolution radio map is very challenging due to the sparse sampling in practic… ▽ More

    Submitted 9 August, 2024; originally announced August 2024.

    Comments: 30 pages, 15 figures

  3. arXiv:2408.13782  [pdf

    eess.IV cs.CV physics.optics

    Batch-FPM: Random batch-update multi-parameter physical Fourier ptychography neural network

    Authors: Ruiqing Sun, Delong Yang, Yiyan Su, Shaohui Zhang, Qun Hao

    Abstract: Fourier Ptychographic Microscopy (FPM) is a computational imaging technique that enables high-resolution imaging over a large field of view. However, its application in the biomedical field has been limited due to the long image reconstruction time and poor noise robustness. In this paper, we propose a fast and robust FPM reconstruction method based on physical neural networks with batch update st… ▽ More

    Submitted 25 August, 2024; originally announced August 2024.

  4. arXiv:2408.07325  [pdf, other

    eess.IV cs.GR

    RoCoSDF: Row-Column Scanned Neural Signed Distance Fields for Freehand 3D Ultrasound Imaging Shape Reconstruction

    Authors: Hongbo Chen, Yuchong Gao, Shuhang Zhang, Jiangjie Wu, Yuexin Ma, Rui Zheng

    Abstract: The reconstruction of high-quality shape geometry is crucial for developing freehand 3D ultrasound imaging. However, the shape reconstruction of multi-view ultrasound data remains challenging due to the elevation distortion caused by thick transducer probes. In this paper, we present a novel learning-based framework RoCoSDF, which can effectively generate an implicit surface through continuous sha… ▽ More

    Submitted 14 August, 2024; originally announced August 2024.

    Comments: Accepted by MICCAI 2024

  5. arXiv:2408.05319  [pdf, ps, other

    eess.SY

    Learning-based Parameterized Barrier Function for Safety-Critical Control of Unknown Systems

    Authors: Sihua Zhang, Di-Hua Zhai, Xiaobing Dai, Tzu-yuan Huang, Yuanqing Xia, Sandra Hirche

    Abstract: With the increasing complexity of real-world systems and varying environmental uncertainties, it is difficult to build an accurate dynamic model, which poses challenges especially for safety-critical control. In this paper, a learning-based control policy is proposed to ensure the safety of systems with unknown disturbances through control barrier functions (CBFs). First, the disturbance is predic… ▽ More

    Submitted 9 August, 2024; originally announced August 2024.

  6. arXiv:2408.05117  [pdf, other

    eess.IV cs.AI cs.CV

    Beyond the Eye: A Relational Model for Early Dementia Detection Using Retinal OCTA Images

    Authors: Shouyue Liu, Jinkui Hao, Yonghuai Liu, Huazhu Fu, Xinyu Guo, Shuting Zhang, Yitian Zhao

    Abstract: Early detection of dementia, such as Alzheimer's disease (AD) or mild cognitive impairment (MCI), is essential to enable timely intervention and potential treatment. Accurate detection of AD/MCI is challenging due to the high complexity, cost, and often invasive nature of current diagnostic techniques, which limit their suitability for large-scale population screening. Given the shared embryologic… ▽ More

    Submitted 9 August, 2024; originally announced August 2024.

  7. arXiv:2408.04927  [pdf, other

    cs.NI eess.SP

    Large Models for Aerial Edges: An Edge-Cloud Model Evolution and Communication Paradigm

    Authors: Shuhang Zhang, Qingyu Liu, Ke Chen, Boya Di, Hongliang Zhang, Wenhan Yang, Dusit Niyato, Zhu Han, H. Vincent Poor

    Abstract: The future sixth-generation (6G) of wireless networks is expected to surpass its predecessors by offering ubiquitous coverage through integrated air-ground facility deployments in both communication and computing domains. In this network, aerial facilities, such as unmanned aerial vehicles (UAVs), conduct artificial intelligence (AI) computations based on multi-modal data to support diverse applic… ▽ More

    Submitted 9 August, 2024; originally announced August 2024.

  8. arXiv:2408.03361  [pdf, other

    eess.IV cs.CV

    GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI

    Authors: Pengcheng Chen, Jin Ye, Guoan Wang, Yanjun Li, Zhongying Deng, Wei Li, Tianbin Li, Haodong Duan, Ziyan Huang, Yanzhou Su, Benyou Wang, Shaoting Zhang, Bin Fu, Jianfei Cai, Bohan Zhuang, Eric J Seibel, Junjun He, Yu Qiao

    Abstract: Large Vision-Language Models (LVLMs) are capable of handling diverse data types such as imaging, text, and physiological signals, and can be applied in various fields. In the medical field, LVLMs have a high potential to offer substantial assistance for diagnosis and treatment. Before that, it is crucial to develop benchmarks to evaluate LVLMs' effectiveness in various medical applications. Curren… ▽ More

    Submitted 9 August, 2024; v1 submitted 6 August, 2024; originally announced August 2024.

  9. arXiv:2408.03174  [pdf, ps, other

    eess.SP cs.IT

    Joint Transmission and Compression Design for 6G Networked Sensing with Limited-Capacity Backhaul

    Authors: Weifeng Zhu, Shuowen Zhang, Liang Liu

    Abstract: This paper considers networked sensing in cellular network, where multiple base stations (BSs) first compress their received echo signals from multiple targets and then forward the quantized signals to the cloud via limited-capacity backhaul links, such that the cloud can leverage all useful echo signals to perform high-resolution localization. Under this setup, we manage to characterize the poste… ▽ More

    Submitted 6 August, 2024; originally announced August 2024.

    Comments: accepted by IEEE Globecom 2025

  10. arXiv:2408.03055  [pdf, other

    eess.SP

    FDA Jamming Against Airborne Phased-MIMO Radar-Part II: Jamming STAP Performance Analysis

    Authors: Yan Sun, Wen-qin Wang, Zhou He, Shunsheng Zhang

    Abstract: The first part of this series introduced the effectiveness of frequency diverse array (FDA) jamming through direct wave propagation in countering airborne phased multiple-input multiple-output (Phased-MIMO) radar. This part focuses on the effectiveness of FDA scattered wave (FDA-SW) jamming on the space-time adaptive processing (STAP) for airborne phased-MIMO radar. Distinguished from the clutter… ▽ More

    Submitted 6 August, 2024; originally announced August 2024.

  11. arXiv:2408.03050  [pdf, other

    eess.SP

    FDA Jamming Against Airborne Phased-MIMO Radar-Part I: Matched Filtering and Spatial Filtering

    Authors: Yan Sun, Wen-qin Wang, Zhou He, Shunsheng Zhang

    Abstract: Phased multiple-input multiple-output (Phased-MIMO) radar has received increasing attention for enjoying the advantages of waveform diversity and range-dependency from frequency diverse array MIMO (FDA-MIMO) radar without sacrificing coherent processing gain through partitioning transmit subarray. This two-part series proposes a framework of electronic countermeasures (ECM) inspired by frequency d… ▽ More

    Submitted 6 August, 2024; originally announced August 2024.

  12. arXiv:2408.03045  [pdf, other

    eess.SP

    Coherent FDA Radar: Transmitter and Receiver Design and Analysis

    Authors: Yan Sun, Ming-jie Jia, Wen-qin Wang, Maria Sabrina Greco, Fulvio Gini, Shunsheng Zhang

    Abstract: The combination of frequency diverse array (FDA) radar technology with the multiple input multiple output (MIMO) radar architecture and waveform diversity techniques potentially promises a high integration gain with respect to conventional phased array (PA) radars. In this paper, we propose an approach to the design of the transmitter and the receiver of a coherent FDA (C-FDA) radar, that enables… ▽ More

    Submitted 6 August, 2024; originally announced August 2024.

  13. arXiv:2408.00379  [pdf, ps, other

    eess.SP cs.IT

    Finding Defective Elements in Intelligent Reflecting Surface via Over-the-Air Measurements

    Authors: Ziyi Zhao, Zhaorui Wang, Shuowen Zhang, Liang Liu

    Abstract: Due to circuit failures, defective elements that cannot adaptively adjust the phase shifts of their impinging signals in a desired manner may exist on an intelligent reflecting surface (IRS). Traditional way to find these defective IRS elements requires a thorough diagnosis of all the circuits belonging to a huge number of IRS elements, which is practically challenging. In this paper, we will devi… ▽ More

    Submitted 1 August, 2024; originally announced August 2024.

    Comments: accepted by 2024 IEEE Globecom

  14. arXiv:2407.20962  [pdf, other

    cs.CV cs.MM cs.SD eess.AS

    MMTrail: A Multimodal Trailer Video Dataset with Language and Music Descriptions

    Authors: Xiaowei Chi, Yatian Wang, Aosong Cheng, Pengjun Fang, Zeyue Tian, Yingqing He, Zhaoyang Liu, Xingqun Qi, Jiahao Pan, Rongyu Zhang, Mengfei Li, Ruibin Yuan, Yanbing Jiang, Wei Xue, Wenhan Luo, Qifeng Chen, Shanghang Zhang, Qifeng Liu, Yike Guo

    Abstract: Massive multi-modality datasets play a significant role in facilitating the success of large video-language models. However, current video-language datasets primarily provide text descriptions for visual frames, considering audio to be weakly related information. They usually overlook exploring the potential of inherent audio-visual correlation, leading to monotonous annotation within each modalit… ▽ More

    Submitted 6 August, 2024; v1 submitted 30 July, 2024; originally announced July 2024.

    Comments: 15 Pages. Dataset report

  15. arXiv:2407.17691  [pdf, other

    cs.NI eess.SY

    System-Level Simulation Framework for NB-IoT: Key Features and Performance Evaluation

    Authors: Shutao Zhang, Wenkun Wen, Peiran Wu, Hongqing Huang, Liya Zhu, Yijia Guo, Tingting Yang, Minghua Xia

    Abstract: Narrowband Internet of Things (NB-IoT) is a technology specifically designated by the 3rd Generation Partnership Project (3GPP) to meet the explosive demand for massive machine-type communications (mMTC), and it is evolving to RedCap. Industrial companies have increasingly adopted NB-IoT as the solution for mMTC due to its lightweight design and comprehensive technical specifications released by 3… ▽ More

    Submitted 13 August, 2024; v1 submitted 24 July, 2024; originally announced July 2024.

  16. arXiv:2407.13377  [pdf, other

    cs.CL cs.AI eess.AS

    Linear-Complexity Self-Supervised Learning for Speech Processing

    Authors: Shucong Zhang, Titouan Parcollet, Rogier van Dalen, Sourav Bhattacharya

    Abstract: Self-supervised learning (SSL) models usually require weeks of pre-training with dozens of high-end GPUs. These models typically have a multi-headed self-attention (MHSA) context encoder. However, MHSA takes quadratic time and space in the input length, contributing to the high pre-training cost. Linear-complexity alternatives to MHSA have been proposed. For instance, in supervised training, the S… ▽ More

    Submitted 18 July, 2024; originally announced July 2024.

    Comments: Interspeech 2024

  17. arXiv:2407.13092  [pdf, other

    eess.IV cs.CV

    CC-DCNet: Dynamic Convolutional Neural Network with Contrastive Constraints for Identifying Lung Cancer Subtypes on Multi-modality Images

    Authors: Yuan Jin, Gege Ma, Geng Chen, Tianling Lyu, Jan Egger, Junhui Lyu, Shaoting Zhang, Wentao Zhu

    Abstract: The accurate diagnosis of pathological subtypes of lung cancer is of paramount importance for follow-up treatments and prognosis managements. Assessment methods utilizing deep learning technologies have introduced novel approaches for clinical diagnosis. However, the majority of existing models rely solely on single-modality image input, leading to limited diagnostic accuracy. To this end, we prop… ▽ More

    Submitted 17 July, 2024; originally announced July 2024.

  18. arXiv:2407.11510  [pdf, other

    eess.AS

    VoxBlink2: A 100K+ Speaker Recognition Corpus and the Open-Set Speaker-Identification Benchmark

    Authors: Yuke Lin, Ming Cheng, Fulin Zhang, Yingying Gao, Shilei Zhang, Ming Li

    Abstract: In this paper, we provide a large audio-visual speaker recognition dataset, VoxBlink2, which includes approximately 10M utterances with videos from 110K+ speakers in the wild. This dataset represents a significant expansion over the VoxBlink dataset, encompassing a broader diversity of speakers and scenarios by the grace of an optimized data collection pipeline. Afterward, we explore the impact of… ▽ More

    Submitted 16 July, 2024; originally announced July 2024.

    Comments: Accepted By InterSpeech2024

  19. arXiv:2407.07306  [pdf

    physics.med-ph eess.SY

    Electrical Impedance Tomography Based Closed-loop Tumor Treating Fields in Dynamic Lung Tumors

    Authors: Minmin Wang, Xu Xie, Yuxi Guo, Liying Zhu, Yue Lan, Haitang Yang, Yun Pan, Guangdi Chen, Shaomin Zhang, Maomao Zhang

    Abstract: Tumor Treating Fields (TTFields) is a non-invasive anticancer modality that utilizes alternating electric fields to disrupt cancer cell division and growth. While generally well-tolerated with minimal side effects, traditional TTFields therapy for lung tumors faces challenges due to the influence of respiratory motion. We design a novel closed-loop TTFields strategy for lung tumors by incorporatin… ▽ More

    Submitted 9 July, 2024; originally announced July 2024.

    Comments: 7 pages, 5 figures

  20. arXiv:2407.06772  [pdf, other

    cs.IT eess.SP

    Revealing the evanescent components in Kronecker-product based codebooks: insights and implications

    Authors: Jun Yang, Yijian Chen, Yunqi Sun, Yuan Si, Hongkang Yu, Shujuan Zhang, Zhaohua Lu

    Abstract: The orthogonal bases of discrete Fourier transform (DFT) has been recognized as the standard spatial-domain bases for Type I, Type II and enhanced Type II codewords by the 3rd Generation Partnership Project (3GPP). For uniform planar arrays, these spatial-domain bases are derived as the Kronecker product of one-dimensional DFT bases. Theoretically, each spatial basis corresponds to a beam directed… ▽ More

    Submitted 9 July, 2024; originally announced July 2024.

    Comments: 11 pages, 9 figures

  21. arXiv:2407.05421  [pdf, other

    eess.AS cs.SD

    ASRRL-TTS: Agile Speaker Representation Reinforcement Learning for Text-to-Speech Speaker Adaptation

    Authors: Ruibo Fu, Xin Qi, Zhengqi Wen, Jianhua Tao, Tao Wang, Chunyu Qiang, Zhiyong Wang, Yi Lu, Xiaopeng Wang, Shuchen Shi, Yukun Liu, Xuefei Liu, Shuai Zhang

    Abstract: Speaker adaptation, which involves cloning voices from unseen speakers in the Text-to-Speech task, has garnered significant interest due to its numerous applications in multi-media fields. Despite recent advancements, existing methods often struggle with inadequate speaker representation accuracy and overfitting, particularly in limited reference speeches scenarios. To address these challenges, we… ▽ More

    Submitted 7 July, 2024; originally announced July 2024.

    Comments: The audio demo is available at https://7xin.github.io/ASRRL/

  22. arXiv:2407.05407  [pdf, other

    cs.SD cs.AI eess.AS

    CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens

    Authors: Zhihao Du, Qian Chen, Shiliang Zhang, Kai Hu, Heng Lu, Yexin Yang, Hangrui Hu, Siqi Zheng, Yue Gu, Ziyang Ma, Zhifu Gao, Zhijie Yan

    Abstract: Recent years have witnessed a trend that large language model (LLM) based text-to-speech (TTS) emerges into the mainstream due to their high naturalness and zero-shot capacity. In this paradigm, speech signals are discretized into token sequences, which are modeled by an LLM with text as prompts and reconstructed by a token-based vocoder to waveforms. Obviously, speech tokens play a critical role… ▽ More

    Submitted 9 July, 2024; v1 submitted 7 July, 2024; originally announced July 2024.

    Comments: work in progress. arXiv admin note: substantial text overlap with arXiv:2407.04051

  23. arXiv:2407.04051  [pdf, other

    cs.SD cs.AI eess.AS

    FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs

    Authors: Keyu An, Qian Chen, Chong Deng, Zhihao Du, Changfeng Gao, Zhifu Gao, Yue Gu, Ting He, Hangrui Hu, Kai Hu, Shengpeng Ji, Yabin Li, Zerui Li, Heng Lu, Haoneng Luo, Xiang Lv, Bin Ma, Ziyang Ma, Chongjia Ni, Changhe Song, Jiaqi Shi, Xian Shi, Hao Wang, Wen Wang, Yuxuan Wang , et al. (8 additional authors not shown)

    Abstract: This report introduces FunAudioLLM, a model family designed to enhance natural voice interactions between humans and large language models (LLMs). At its core are two innovative models: SenseVoice, which handles multilingual speech recognition, emotion recognition, and audio event detection; and CosyVoice, which facilitates natural speech generation with control over multiple languages, timbre, sp… ▽ More

    Submitted 10 July, 2024; v1 submitted 4 July, 2024; originally announced July 2024.

    Comments: Work in progress. Authors are listed in alphabetical order by family name

  24. arXiv:2407.03992  [pdf, other

    eess.IV

    Performance of Medical Image Fusion in High-level Analysis Tasks: A Mutual Enhancement Framework for Unaligned PAT and MRI Image Fusion

    Authors: Yutian Zhong, Jinchuan He, Zhichao Liang, Shuangyang Zhang, Qianjin Feng, Wufan Chen, Li Qi

    Abstract: Photoacoustic tomography (PAT) offers optical contrast, whereas magnetic resonance imaging (MRI) excels in imaging soft tissue and organ anatomy. The fusion of PAT with MRI holds promising application prospects due to their complementary advantages. Existing image fusion have made considerable progress in pre-registered images, yet spatial deformations are difficult to avoid in medical imaging sce… ▽ More

    Submitted 4 July, 2024; originally announced July 2024.

  25. arXiv:2407.03542  [pdf

    eess.IV cs.CV cs.LG

    Probing Perfection: The Relentless Art of Meddling for Pulmonary Airway Segmentation from HRCT via a Human-AI Collaboration Based Active Learning Method

    Authors: Shiyi Wang, Yang Nan, Sheng Zhang, Federico Felder, Xiaodan Xing, Yingying Fang, Javier Del Ser, Simon L F Walsh, Guang Yang

    Abstract: In pulmonary tracheal segmentation, the scarcity of annotated data is a prevalent issue in medical segmentation. Additionally, Deep Learning (DL) methods face challenges: the opacity of 'black box' models and the need for performance enhancement. Our Human-Computer Interaction (HCI) based models (RS_UNet, LC_UNet, UUNet, and WD_UNet) address these challenges by combining diverse query strategies w… ▽ More

    Submitted 23 July, 2024; v1 submitted 3 July, 2024; originally announced July 2024.

  26. arXiv:2407.00463  [pdf, other

    cs.LG cs.AI cs.CL cs.HC eess.AS

    Open-Source Conversational AI with SpeechBrain 1.0

    Authors: Mirco Ravanelli, Titouan Parcollet, Adel Moumen, Sylvain de Langen, Cem Subakan, Peter Plantinga, Yingzhi Wang, Pooneh Mousavi, Luca Della Libera, Artem Ploujnikov, Francesco Paissan, Davide Borra, Salah Zaiem, Zeyu Zhao, Shucong Zhang, Georgios Karakasidis, Sung-Lin Yeh, Pierre Champion, Aku Rouhe, Rudolf Braun, Florian Mai, Juan Zuluaga-Gomez, Seyed Mahed Mousavi, Andreas Nautsch, Xuechen Liu , et al. (7 additional authors not shown)

    Abstract: SpeechBrain is an open-source Conversational AI toolkit based on PyTorch, focused particularly on speech processing tasks such as speech recognition, speech enhancement, speaker recognition, text-to-speech, and much more. It promotes transparency and replicability by releasing both the pre-trained models and the complete "recipes" of code and algorithms required for training them. This paper prese… ▽ More

    Submitted 18 July, 2024; v1 submitted 29 June, 2024; originally announced July 2024.

    Comments: Submitted to JMLR (Machine Learning Open Source Software)

  27. arXiv:2406.19311  [pdf, other

    cs.CR cs.SD eess.AS

    Zero-Query Adversarial Attack on Black-box Automatic Speech Recognition Systems

    Authors: Zheng Fang, Tao Wang, Lingchen Zhao, Shenyi Zhang, Bowen Li, Yunjie Ge, Qi Li, Chao Shen, Qian Wang

    Abstract: In recent years, extensive research has been conducted on the vulnerability of ASR systems, revealing that black-box adversarial example attacks pose significant threats to real-world ASR systems. However, most existing black-box attacks rely on queries to the target ASRs, which is impractical when queries are not permitted. In this paper, we propose ZQ-Attack, a transfer-based adversarial attack… ▽ More

    Submitted 27 June, 2024; originally announced June 2024.

    Comments: To appear in the Proceedings of The ACM Conference on Computer and Communications Security (CCS), 2024

  28. arXiv:2406.18345  [pdf, other

    cs.LG eess.SP

    EmT: A Novel Transformer for Generalized Cross-subject EEG Emotion Recognition

    Authors: Yi Ding, Chengxuan Tong, Shuailei Zhang, Muyun Jiang, Yong Li, Kevin Lim Jun Liang, Cuntai Guan

    Abstract: Integrating prior knowledge of neurophysiology into neural network architecture enhances the performance of emotion decoding. While numerous techniques emphasize learning spatial and short-term temporal patterns, there has been limited emphasis on capturing the vital long-term contextual information associated with emotional cognitive processes. In order to address this discrepancy, we introduce a… ▽ More

    Submitted 26 June, 2024; originally announced June 2024.

    Comments: 11 pages, 5 figures. This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

  29. arXiv:2406.18067  [pdf, other

    cs.CL eess.AS

    Exploring Energy-Based Models for Out-of-Distribution Detection in Dialect Identification

    Authors: Yaqian Hao, Chenguang Hu, Yingying Gao, Shilei Zhang, Junlan Feng

    Abstract: The diverse nature of dialects presents challenges for models trained on specific linguistic patterns, rendering them susceptible to errors when confronted with unseen or out-of-distribution (OOD) data. This study introduces a novel margin-enhanced joint energy model (MEJEM) tailored specifically for OOD detection in dialects. By integrating a generative model and the energy margin loss, our appro… ▽ More

    Submitted 26 June, 2024; originally announced June 2024.

  30. arXiv:2406.18065  [pdf, other

    eess.AS cs.SD

    On Calibration of Speech Classification Models: Insights from Energy-Based Model Investigations

    Authors: Yaqian Hao, Chenguang Hu, Yingying Gao, Shilei Zhang, Junlan Feng

    Abstract: For speech classification tasks, deep learning models often achieve high accuracy but exhibit shortcomings in calibration, manifesting as classifiers exhibiting overconfidence. The significance of calibration lies in its critical role in guaranteeing the reliability of decision-making within deep learning systems. This study explores the effectiveness of Energy-Based Models in calibrating confiden… ▽ More

    Submitted 26 June, 2024; originally announced June 2024.

  31. arXiv:2406.17788  [pdf, other

    eess.SY cs.LG eess.SP

    CNN-based Compressor Mass Flow Estimator in Industrial Aircraft Vapor Cycle System

    Authors: Justin Reverdi, Sixin Zhang, Saïd Aoues, Fabrice Gamboa, Serge Gratton, Thomas Pellegrini

    Abstract: In Vapor Cycle Systems, the mass flow sensor playsa key role for different monitoring and control purposes. However,physical sensors can be inaccurate, heavy, cumbersome, expensive orhighly sensitive to vibrations, which is especially problematic whenembedded into an aircraft. The conception of a virtual sensor, basedon other standard sensors, is a good alternative. This paper has twomain objectiv… ▽ More

    Submitted 27 May, 2024; originally announced June 2024.

  32. arXiv:2406.16189  [pdf, other

    eess.IV cs.CV

    Fuzzy Attention-based Border Rendering Network for Lung Organ Segmentation

    Authors: Sheng Zhang, Yang Nan, Yingying Fang, Shiyi Wang, Xiaodan Xing, Zhifan Gao, Guang Yang

    Abstract: Automatic lung organ segmentation on CT images is crucial for lung disease diagnosis. However, the unlimited voxel values and class imbalance of lung organs can lead to false-negative/positive and leakage issues in advanced methods. Additionally, some slender lung organs are easily lost during the recycled down/up-sample procedure, e.g., bronchioles & arterioles, causing severe discontinuity issue… ▽ More

    Submitted 1 July, 2024; v1 submitted 23 June, 2024; originally announced June 2024.

    Comments: MICCAI 2024

  33. arXiv:2406.15047  [pdf, other

    cs.IT eess.SP

    Optimal Transmit Signal Design for Multi-Target MIMO Sensing Exploiting Prior Information

    Authors: Jiayi Yao, Shuowen Zhang

    Abstract: In this paper, we study the transmit signal optimization in a multiple-input multiple-output (MIMO) radar system for sensing the angle information of multiple targets via their reflected echo signals. We consider a challenging and practical scenario where the angles to be sensed are unknown and random, while their probability information is known a priori for exploitation. First, we establish an a… ▽ More

    Submitted 21 June, 2024; originally announced June 2024.

    Comments: submitted for possible piblication

  34. arXiv:2406.13674  [pdf, other

    eess.IV cs.CV

    Rethinking Abdominal Organ Segmentation (RAOS) in the clinical scenario: A robustness evaluation benchmark with challenging cases

    Authors: Xiangde Luo, Zihan Li, Shaoting Zhang, Wenjun Liao, Guotai Wang

    Abstract: Deep learning has enabled great strides in abdominal multi-organ segmentation, even surpassing junior oncologists on common cases or organs. However, robustness on corner cases and complex organs remains a challenging open problem for clinical adoption. To investigate model robustness, we collected and annotated the RAOS dataset comprising 413 CT scans ($\sim$80k 2D images, $\sim$8k 3D organ annot… ▽ More

    Submitted 19 June, 2024; originally announced June 2024.

    Comments: 10 pages, 1 figure, 6 tables, Early Accept to MICCAI 2024

  35. arXiv:2406.13268  [pdf, other

    eess.AS cs.SD

    CEC: A Noisy Label Detection Method for Speaker Recognition

    Authors: Yao Shen, Yingying Gao, Yaqian Hao, Chenguang Hu, Fulin Zhang, Junlan Feng, Shilei Zhang

    Abstract: Noisy labels are inevitable, even in well-annotated datasets. The detection of noisy labels is of significant importance to enhance the robustness of speaker recognition models. In this paper, we propose a novel noisy label detection approach based on two new statistical metrics: Continuous Inconsistent Counting (CIC) and Total Inconsistent Counting (TIC). These metrics are calculated through Cros… ▽ More

    Submitted 19 June, 2024; originally announced June 2024.

    Comments: interspeech 2024

  36. arXiv:2406.11169   

    eess.AS cs.SD

    Self-Distillation Prototypes Network: Learning Robust Speaker Representations without Supervision

    Authors: Yafeng Chen, Siqi Zheng, Hui Wang, Luyao Cheng, Qian Chen, Shiliang Zhang, Wen Wang

    Abstract: Training speaker-discriminative and robust speaker verification systems without explicit speaker labels remains a persisting challenge. In this paper, we propose a new self-supervised speaker verification approach, Self-Distillation Prototypes Network (SDPN), which effectively facilitates self-supervised speaker representation learning. SDPN assigns the representation of the augmented views of an… ▽ More

    Submitted 25 June, 2024; v1 submitted 16 June, 2024; originally announced June 2024.

    Comments: We update this paper to an earlier paper

  37. arXiv:2406.10591  [pdf, other

    eess.AS cs.AI cs.CV cs.MM cs.SD

    MINT: a Multi-modal Image and Narrative Text Dubbing Dataset for Foley Audio Content Planning and Generation

    Authors: Ruibo Fu, Shuchen Shi, Hongming Guo, Tao Wang, Chunyu Qiang, Zhengqi Wen, Jianhua Tao, Xin Qi, Yi Lu, Xiaopeng Wang, Zhiyong Wang, Yukun Liu, Xuefei Liu, Shuai Zhang, Guanjun Li

    Abstract: Foley audio, critical for enhancing the immersive experience in multimedia content, faces significant challenges in the AI-generated content (AIGC) landscape. Despite advancements in AIGC technologies for text and image generation, the foley audio dubbing remains rudimentary due to difficulties in cross-modal scene matching and content correlation. Current text-to-audio technology, which relies on… ▽ More

    Submitted 15 June, 2024; originally announced June 2024.

  38. arXiv:2406.09589  [pdf, other

    eess.AS

    Multi-Channel Multi-Speaker ASR Using Target Speaker's Solo Segment

    Authors: Yiwen Shao, Shi-Xiong Zhang, Yong Xu, Meng Yu, Dong Yu, Daniel Povey, Sanjeev Khudanpur

    Abstract: In the field of multi-channel, multi-speaker Automatic Speech Recognition (ASR), the task of discerning and accurately transcribing a target speaker's speech within background noise remains a formidable challenge. Traditional approaches often rely on microphone array configurations and the information of the target speaker's location or voiceprint. This study introduces the Solo Spatial Feature (S… ▽ More

    Submitted 17 June, 2024; v1 submitted 13 June, 2024; originally announced June 2024.

    Comments: Accepted for presentation at Interspeech 2024

  39. arXiv:2406.09444  [pdf, other

    eess.AS cs.CL cs.SD

    GenDistiller: Distilling Pre-trained Language Models based on an Autoregressive Generative Model

    Authors: Yingying Gao, Shilei Zhang, Chao Deng, Junlan Feng

    Abstract: Pre-trained speech language models such as HuBERT and WavLM leverage unlabeled speech data for self-supervised learning and offer powerful representations for numerous downstream tasks. Despite the success of these models, their high requirements for memory and computing resource hinder their application on resource restricted devices. Therefore, this paper introduces GenDistiller, a novel knowled… ▽ More

    Submitted 21 June, 2024; v1 submitted 11 June, 2024; originally announced June 2024.

    Comments: arXiv admin note: text overlap with arXiv:2310.13418

  40. Blind Super-Resolution via Meta-learning and Markov Chain Monte Carlo Simulation

    Authors: Jingyuan Xia, Zhixiong Yang, Shengxi Li, Shuanghui Zhang, Yaowen Fu, Deniz Gündüz, Xiang Li

    Abstract: Learning-based approaches have witnessed great successes in blind single image super-resolution (SISR) tasks, however, handcrafted kernel priors and learning based kernel priors are typically required. In this paper, we propose a Meta-learning and Markov Chain Monte Carlo (MCMC) based SISR approach to learn kernel priors from organized randomness. In concrete, a lightweight network is adopted as k… ▽ More

    Submitted 13 June, 2024; originally announced June 2024.

    Comments: This paper has been accepted for publication in IEEE Transactions on Pattern Analysis and Machine Intelligence (2024)

  41. arXiv:2406.07801  [pdf, other

    cs.CL cs.SD eess.AS

    PolySpeech: Exploring Unified Multitask Speech Models for Competitiveness with Single-task Models

    Authors: Runyan Yang, Huibao Yang, Xiqing Zhang, Tiantian Ye, Ying Liu, Yingying Gao, Shilei Zhang, Chao Deng, Junlan Feng

    Abstract: Recently, there have been attempts to integrate various speech processing tasks into a unified model. However, few previous works directly demonstrated that joint optimization of diverse tasks in multitask speech models has positive influence on the performance of individual tasks. In this paper we present a multitask speech model -- PolySpeech, which supports speech recognition, speech synthesis,… ▽ More

    Submitted 11 June, 2024; originally announced June 2024.

    Comments: 5 pages, 2 figures

  42. arXiv:2406.07289  [pdf, other

    cs.CL cs.AI cs.SD eess.AS

    Can We Achieve High-quality Direct Speech-to-Speech Translation without Parallel Speech Data?

    Authors: Qingkai Fang, Shaolei Zhang, Zhengrui Ma, Min Zhang, Yang Feng

    Abstract: Recently proposed two-pass direct speech-to-speech translation (S2ST) models decompose the task into speech-to-text translation (S2TT) and text-to-speech (TTS) within an end-to-end model, yielding promising results. However, the training of these models still relies on parallel speech data, which is extremely challenging to collect. In contrast, S2TT and TTS have accumulated a large amount of data… ▽ More

    Submitted 11 June, 2024; originally announced June 2024.

    Comments: ACL 2024 main conference. Project Page: https://ictnlp.github.io/ComSpeech-Site/

    ACM Class: I.2.7

  43. arXiv:2406.06937  [pdf, other

    cs.CL cs.AI cs.SD eess.AS

    A Non-autoregressive Generation Framework for End-to-End Simultaneous Speech-to-Any Translation

    Authors: Zhengrui Ma, Qingkai Fang, Shaolei Zhang, Shoutao Guo, Yang Feng, Min Zhang

    Abstract: Simultaneous translation models play a crucial role in facilitating communication. However, existing research primarily focuses on text-to-text or speech-to-text models, necessitating additional cascade components to achieve speech-to-speech translation. These pipeline methods suffer from error propagation and accumulate delays in each cascade component, resulting in reduced synchronization betwee… ▽ More

    Submitted 11 June, 2024; originally announced June 2024.

    Comments: ACL 2024; Codes and demos are at https://github.com/ictnlp/NAST-S2x

  44. arXiv:2406.06619  [pdf, other

    eess.AS cs.AI cs.CL

    LoRA-Whisper: Parameter-Efficient and Extensible Multilingual ASR

    Authors: Zheshu Song, Jianheng Zhuo, Yifan Yang, Ziyang Ma, Shixiong Zhang, Xie Chen

    Abstract: Recent years have witnessed significant progress in multilingual automatic speech recognition (ASR), driven by the emergence of end-to-end (E2E) models and the scaling of multilingual datasets. Despite that, two main challenges persist in multilingual ASR: language interference and the incorporation of new languages without degrading the performance of the existing ones. This paper proposes LoRA-W… ▽ More

    Submitted 7 June, 2024; originally announced June 2024.

    Comments: 5 pages, 2 figures, conference

  45. arXiv:2406.05839  [pdf, other

    eess.AS cs.AI

    MaLa-ASR: Multimedia-Assisted LLM-Based ASR

    Authors: Guanrou Yang, Ziyang Ma, Fan Yu, Zhifu Gao, Shiliang Zhang, Xie Chen

    Abstract: As more and more information-rich data like video become available, utilizing multi-modal auxiliary information to enhance audio tasks has sparked widespread research interest. The recent surge in research on LLM-based audio models provides fresh perspectives for tackling audio tasks. Given that LLM can flexibly ingest multiple inputs, we propose MaLa-ASR, an LLM-based ASR model that can integrate… ▽ More

    Submitted 13 June, 2024; v1 submitted 9 June, 2024; originally announced June 2024.

  46. arXiv:2406.03049  [pdf, other

    cs.CL cs.AI cs.SD eess.AS

    StreamSpeech: Simultaneous Speech-to-Speech Translation with Multi-task Learning

    Authors: Shaolei Zhang, Qingkai Fang, Shoutao Guo, Zhengrui Ma, Min Zhang, Yang Feng

    Abstract: Simultaneous speech-to-speech translation (Simul-S2ST, a.k.a streaming speech translation) outputs target speech while receiving streaming speech inputs, which is critical for real-time communication. Beyond accomplishing translation between speech, Simul-S2ST requires a policy to control the model to generate corresponding target speech at the opportune moment within speech inputs, thereby posing… ▽ More

    Submitted 5 June, 2024; originally announced June 2024.

    Comments: Accepted to ACL 2024 main conference, Project Page: https://ictnlp.github.io/StreamSpeech-site/

  47. arXiv:2406.02430  [pdf, other

    eess.AS cs.SD

    Seed-TTS: A Family of High-Quality Versatile Speech Generation Models

    Authors: Philip Anastassiou, Jiawei Chen, Jitong Chen, Yuanzhe Chen, Zhuo Chen, Ziyi Chen, Jian Cong, Lelai Deng, Chuang Ding, Lu Gao, Mingqing Gong, Peisong Huang, Qingqing Huang, Zhiying Huang, Yuanyuan Huo, Dongya Jia, Chumin Li, Feiya Li, Hui Li, Jiaxin Li, Xiaoyang Li, Xingxing Li, Lin Liu, Shouda Liu, Sichao Liu , et al. (21 additional authors not shown)

    Abstract: We introduce Seed-TTS, a family of large-scale autoregressive text-to-speech (TTS) models capable of generating speech that is virtually indistinguishable from human speech. Seed-TTS serves as a foundation model for speech generation and excels in speech in-context learning, achieving performance in speaker similarity and naturalness that matches ground truth human speech in both objective and sub… ▽ More

    Submitted 4 June, 2024; originally announced June 2024.

  48. arXiv:2406.02291  [pdf, other

    cs.NI eess.SP

    A deep-learning-based MAC for integrating channel access, rate adaptation and channel switch

    Authors: Jiantao Xin, Wei Xu, Bin Cao, Taotao Wang, Shengli Zhang

    Abstract: With increasing density and heterogeneity in unlicensed wireless networks, traditional MAC protocols, such as carrier-sense multiple access with collision avoidance (CSMA/CA) in Wi-Fi networks, are experiencing performance degradation. This is manifested in increased collisions and extended backoff times, leading to diminished spectrum efficiency and protocol coordination. Addressing these issues,… ▽ More

    Submitted 4 June, 2024; originally announced June 2024.

  49. arXiv:2406.02247  [pdf, other

    physics.ins-det eess.SY

    A Study of the Latest Updates of the Readout System for the Hybird-Pixel Detector at HEPS

    Authors: Hangxu Li, Jie Zhang, Wei Wei, Zhenjie Li, Xiaolu Ji, Yan Zhang, Xuanzheng Yang, Shuihan Zhang, Xueke Ma, Peng Liu, Zheng Wang, Yuanbai Chen

    Abstract: The High Energy Photon Source (HEPS) represents a fourth-generation light source. This facility has made unprecedented advancements in accelerator technology, necessitating the development of new detectors to satisfy physical requirements such as single-photon resolution, large dynamic range, and high frame rates. Since 2016, the Institute of High Energy Physics has introduced the first user-exper… ▽ More

    Submitted 4 June, 2024; originally announced June 2024.

  50. arXiv:2406.02167  [pdf, other

    eess.AS eess.SP

    ERes2NetV2: Boosting Short-Duration Speaker Verification Performance with Computational Efficiency

    Authors: Yafeng Chen, Siqi Zheng, Hui Wang, Luyao Cheng, Qian Chen, Shiliang Zhang, Junjie Li

    Abstract: Speaker verification systems experience significant performance degradation when tasked with short-duration trial recordings. To address this challenge, a multi-scale feature fusion approach has been proposed to effectively capture speaker characteristics from short utterances. Constrained by the model's size, a robust backbone Enhanced Res2Net (ERes2Net) combining global and local feature fusion… ▽ More

    Submitted 4 June, 2024; originally announced June 2024.