Google Scholar

On the open prompt challenge in conditional audio generation

E Chang, S Srinivasan, M Luthra, PJ Lin… - ICASSP 2024-2024 …, 2024 - ieeexplore.ieee.org

E Chang, S Srinivasan, M Luthra, PJ Lin, V Nagaraja, F Iandola, Z Liu, Z Ni, C Zhao, Y Shi…

ICASSP 2024-2024 IEEE International Conference on Acoustics …, 2024•ieeexplore.ieee.org

Text-to-audio generation (TTA) produces audio from a text description, learning from pairs of audio samples and hand-annotated text. However, commercializing audio generation is challenging as user-input prompts are often under-specified when compared to text descriptions used to train TTA models. In this work, we treat TTA models as a "blackbox" and address the user prompt challenge with two key insights: (1) User prompts are generally under-specified, leading to a large alignment gap between user prompts and training prompts. (2) There is a distribution of audio descriptions for which TTA models are better at generating higher quality audio, which we refer to as "audionese". To this end, we rewrite prompts with instruction-tuned models and propose utilizing text-audio alignment as feedback signals via margin ranking learning for audio improvements. On both objective and subjective human evaluations, we observed marked improvements in both text-audio alignment and music audio quality.

ieeexplore.ieee.org

Show moreShow less

Save Cite Cited by 8 Related articles All 5 versions

Showing the best result for this search. See all results

Cite

Advanced search

Saved to My library

On the open prompt challenge in conditional audio generation