VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild

Puyuan Peng1   Po-Yao Huang2   Shang-Wen Li2
Abdelrahman Mohamed3   David Harwath1
1The University of Texas at Austin   2FAIR, Meta   3Rembrand
[email protected]
Abstract

We introduce VoiceCraft, a token infilling neural codec language model, that achieves state-of-the-art performance on both speech editing and zero-shot text-to-speech (TTS) on audiobooks, internet videos, and podcasts111Data, code, and model weights are available at https://github.com/jasonppy/VoiceCraft.. VoiceCraft employs a Transformer decoder architecture and introduces a token rearrangement procedure that combines causal masking and delayed stacking to enable generation within an existing sequence. On speech editing tasks, VoiceCraft produces edited speech that is nearly indistinguishable from unedited recordings in terms of naturalness, as evaluated by humans; for zero-shot TTS, our model outperforms prior SotA models including VALL-E and the popular commercial model XTTS v2. Crucially, the models are evaluated on challenging and realistic datasets, that consist of diverse accents, speaking styles, recording conditions, and background noise and music, and our model performs consistently well compared to other models and real recordings. In particular, for speech editing evaluation, we introduce a high quality, challenging, and realistic dataset named RealEdit. We encourage readers to listen to the demos at https://jasonppy.github.io/VoiceCraft_web.

VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild


Puyuan Peng1   Po-Yao Huang2   Shang-Wen Li2 Abdelrahman Mohamed3   David Harwath1 1The University of Texas at Austin   2FAIR, Meta   3Rembrand [email protected]


1 Introduction

Refer to caption
Figure 1: Speech editing with VoiceCraft. Human listeners prefer VoiceCraft edited speech over the original real recording 48% of the time in side-by-side naturalness comparison (details in §5.3)

We introduce VoiceCraft, a Transformer-based neural codec language model (NCLM) that performs infilling generation of neural speech codec tokens autoregressively conditioned on bidirectional context. VoiceCraft achieves state-of-the-art (SotA) performance on both speech editing (shown in Fig. 1) and zero-shot TTS. Our method is based on a two-step token rearrangement procedure that consists of a causal masking step and delayed stacking step. The causal masking technique is inspired by the success of causal masked multimodal model in joint text-image modeling (Aghajanyan et al., 2022), and our proposed technique works for speech codec sequences, which enables autoregressive generation with bidirectional context. In addition, we further integrate causal masking with delayed stacking (Kharitonov et al., 2021a; Copet et al., 2023) as our proposed token rearrangement procedure, to ensure efficient multi-codebook modeling.

To evaluate speech editing, we manually crafted a first-of-its-kind, realistic, and challenging dataset named RealEdit. RealEdit consists of 310310310310 real world speech editing examples, with waveforms sourced from audiobooks (Zen et al., 2019), YouTube videos (Chen et al., 2021a), and Spotify podcasts (Clifton et al., 2020), and duration ranging from 5555 seconds to 12121212 seconds. To create the target transcripts, the transcripts of the source speech are edited in such a way that the edited transcripts remain grammatically correct and are semantically coherent. The dataset is designed to cover a wide range of editing scenarios, including insertion, deletion, substitution, and multi-span editing, with the length of the edited text ranging from 1111 word to 16161616 words. Compared to commonly used speech synthesis evaluation datasets that only contain audiobooks such as VCTK (Yamagishi et al., 2019), LJSpeech (Ito and Johnson, 2017), and LibriTTS Zen et al. (2019), RealEdit is more challenging in that the recordings have diverse content, accents, speaking styles, recording conditions, and background sounds. We believe that the realism and diversity of RealEdit makes it a reliable indicator of the practicality of speech editing models in the real world.

In the subjective human listening tests, VoiceCraft significantly outperforms prior SotA speech editing model on RealEdit. Importantly, the edited speech produced by VoiceCraft is nearly indistinguishable from the original unedited recording in terms of naturalness. We found that VoiceCraft  generalizes well to zero-shot TTS without any finetuning, achieving SotA performance on a dataset comprised of audiobooks and YouTube videos, outperforming strong baselines including reproduced VALL-E Wang et al. (2023a) and the popular commercial model XTTS v2 COQUI (2023). In summary, our contributions are:

  1. 1.

    We introduce VoiceCraft, a neural codec language model for speech editing that generates synthesized speech that is nearly indistinguishable from in-the-wild recordings according to human listeners. We also release the code and model weights for VoiceCraft.

  2. 2.

    We show that VoiceCraft generalizes well to zero-shot TTS without finetuning.

  3. 3.

    We release a high quality, challenging, and realistic speech editing evaluation dataset RealEdit.

2 Related Work

Neural codec langauge models (NCLM) and zero-shot TTS. Tokenizing speech signals into sequences of learnable, discrete units and then training a language model on the resulting unit sequences was initially proposed in the context of textless NLP (Hsu et al., 2021; Lakhotia et al., 2021; Kharitonov et al., 2021b; Nguyen et al., 2022), where the goal is to perform NLP tasks directly on spoken utterances without the need to first transcribe the speech into text. Recently, NCLMs that operates on tokens from Residual vector quantization (RVQ)-based models (Zeghidour et al., 2021; Defossez et al., 2022) attract increased attention due to its high quality generation. For example, AudioLM (Borsos et al., 2022a) exhibits strong performance on long-term coherent speech continuation. Zero-shot TTS is a task where a model needs to synthesize speech in a target voice which was unseen during training, given only the target transcript and a short reference recording of the target voice. Framing zero-shot TTS as transcript-conditioned speech continuation, VALL-E (Wang et al., 2023a) and Spear-TTS Kharitonov et al. (2023) are the first applications of NCLMs on this task, significantly outperforming non-NCLM approaches. Zhang et al. (2023) extends VALL-E to cross-lingual TTS. Guo et al. (2022); Yang et al. (2023); Liu et al. (2023); Ji et al. (2023); Lyth and King (2024) adapt NCLMs style-controlled speech synthesis. Song et al. (2024); Du et al. (2024b) enhance phoneme alignment in NCLMs to reduce error. Wang et al. (2023b) proposes a unified NCLM for both speech generation and recognition tasks. Borsos et al. (2023) proposes an efficient parallel decoding method. Jiang et al. (2024) proposes disentangled timbre and prosody modeling, where the latter is modeled with a NCLM. NCLMs have also been successfully applied to other audio domains. Kreuk et al. (2022) applies NCLM to sound effects generation, and Agostinelli et al. (2023); Donahue et al. (2023); Garcia et al. (2023); Copet et al. (2023) use NCLMs for music generation.

Speech editing. This task requires a model to alter words or phrases within an utterance to match a target transcript, but the regions of the original speech not targeted for editing must remain unchanged (see Fig. 1 for an example). Early methods achieve text-guided speech insertion and substitution by combining a single speaker TTS model and a voice conversion model to generate desired speech segment, which is then concatenated with unedited part Jin et al. (2017). Since the generation is not conditioned on the unedited part of the speech, the result sounds unnatural due to prosody mismatch and boundary artifacts Morrison et al. (2021). More recent speech editing models have attempted to condition their generation on surrounding speech context. Tan et al. (2021) uses two unidirectional LSTM models with bidirectional fusion. Wang et al. (2022); Bai et al. (2022); Borsos et al. (2022b) uses the masked reconstruction objective with Convolutional or Transformer models to further improve contextualization. FluentSpeech (Jiang et al., 2023b) is a diffusion-based speech editing model that achieves SotA performance on speech editing on LibriTTS and VCTK.

The research community starts to investigate the possibility of having a unified model for both zero-shot TTS and speech editing. Yin et al. (2022); Jiang et al. (2023a) propose modular models for the two tasks, while our model is end-to-end. Concurrent work SpeechX (Wang et al., 2023c) adapt VALL-E by prompt tuning for a range of tasks including speech editing and zero-shot TTS, but no human evaluation is conducted in their paper. Concurrent work UniCATS Du et al. (2024a) is a diffusion-based modular model for the two tasks. However their model is only evaluated on masked speech reconstruction of span length less than 2 seconds, while our model is evaluated on as much as 16 words editing. Voicebox (Le et al., 2023) is a recent flow matching based model capable of a wide range of tasks including speech editing and zero-shot TTS. However the speech editing capability is not evaluated in their paper, and only shown in their demo page. We therefore compare our model’s editing results with Voicebox’s on our demo page using on the same examples from their demo page.

3 Method

VoiceCraft casts both sequence infilling (for speech editing) and continuation (for zero-shot TTS) as a simple left-to-right language modeling by rearranging neural codec’s output tokens. The rearrangement involves two steps: (1) causal masking (§3.1) to enable autoregressive continuation/infilling with bidirectional context and (2) delayed stacking (§3.2) to ensure efficient multi-codebook modeling. VoiceCraft employs decoder-only Transformers and is trained with an autoregressive sequence prediction (§3.3). We introduce the inference setup for speech editing and zero-shot TTS in §3.4.

Refer to caption
Figure 2: An example of the token rearrangement procedure and modeling framework. The rearrangement procedure involves two steps: (1) Causal masking, where masked spans are replaced with mask tokens and moved to the end, and (2) Delayed stacking, where tokens are shifted in the time dimension based on their codebook index.

3.1 Rearrangement Step 1: Causal Masking

As shown on the left hand side of Fig. 2, given a continuous speech waveform as input, we first use Encodec (Defossez et al., 2022) to quantize it into a T𝑇Titalic_T by K𝐾Kitalic_K codec matrix X𝑋Xitalic_X, where T𝑇Titalic_T is the number of temporal frames, and K𝐾Kitalic_K is the number of RVQ codebooks. X𝑋Xitalic_X can be written as (X1,,XT)subscript𝑋1subscript𝑋𝑇(X_{1},\cdots,X_{T})( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_X start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ), where Xtsubscript𝑋𝑡X_{t}italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is a vector of length K𝐾Kitalic_K representing the codes from different codebooks at time step t𝑡titalic_t, and we assume that code from codebook k𝑘kitalic_k models the residual from codebook k1𝑘1k-1italic_k - 1. During training, our goal is to randomly mask some span of tokens (Xt0,,Xt1)subscript𝑋subscript𝑡0subscript𝑋subscript𝑡1(X_{t_{0}},\dots,X_{t_{1}})( italic_X start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ), and then autoregressively predict these masked tokens conditioned on all of the unmasked tokens. This is a problem when t1<Tsubscript𝑡1𝑇t_{1}<Titalic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT < italic_T, because we cannot condition on future outputs when performing autoregressive generation. We need to modify the masking on X𝑋Xitalic_X so that it is causal, by moving the span to be masked to the end of the sequence, so that when infilling these tokens the model can condition on both past and future unmasked tokens Aghajanyan et al. (2022); Donahue et al. (2020); Bavarian et al. (2022).

The procedure outlined above can be trivially extended to multiple masked spans by simply moving all masked spans to the end of the sequence. The number of spans to be masked n𝑛nitalic_n is sampled from Poison(λ)Poison𝜆\text{Poison}(\lambda)Poison ( italic_λ ), and then for each span, we sample a span length lUniform(1,L)similar-to𝑙Uniform1𝐿l\sim\text{Uniform}(1,L)italic_l ∼ Uniform ( 1 , italic_L ). Finally, we randomly select the locations of the spans within X𝑋Xitalic_X under the constraint that they do not overlap with each other. The selected n𝑛nitalic_n spans are then replaced with mask tokens ⟨M1,,⟨Mn⟨M1⟨Mn\text{\textlangle M${}_{1}$\textrangle},\cdots,\text{\textlangle M${}_{n}$\textrangle}⟨M ⟩ , ⋯ , ⟨M ⟩. The original tokens within these masked spans are moved to the end of the sequence X𝑋Xitalic_X, with each span preceded by its corresponding mask token.

Consider this example: let X=(X1,,X6)𝑋subscript𝑋1subscript𝑋6X=(X_{1},\dots,X_{6})italic_X = ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT ) and imagine we wish to mask a single span from X2subscript𝑋2X_{2}italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT to X4subscript𝑋4X_{4}italic_X start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT. The original sequence X𝑋Xitalic_X is rearranged into Y=(Y1;⟨M1;Y2;⟨M1;Y3;)Y=(Y_{1};\text{\textlangle M${}_{1}$\textrangle};Y_{2};\text{\textlangle M${}_% {1}$\textrangle};Y_{3};)italic_Y = ( italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ; ⟨M ⟩ ; italic_Y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ; ⟨M ⟩ ; italic_Y start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ; ), where Y1=(X1)subscript𝑌1subscript𝑋1Y_{1}=(X_{1})italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ), Y2=(X5,X6)subscript𝑌2subscript𝑋5subscript𝑋6Y_{2}=(X_{5},X_{6})italic_Y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = ( italic_X start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT ), and Y3=(X2,X3,X4)subscript𝑌3subscript𝑋2subscript𝑋3subscript𝑋4Y_{3}=(X_{2},X_{3},X_{4})italic_Y start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = ( italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ). We call Y1subscript𝑌1Y_{1}italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and Y2subscript𝑌2Y_{2}italic_Y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT the unmasked spans, and Y3subscript𝑌3Y_{3}italic_Y start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT the masked span. An end of span or EOS token is added to the end of each masked span (in this example at the end of Y3subscript𝑌3Y_{3}italic_Y start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT), and an end of utterance or EOU token is added to the end of the utterance (i.e. Y2subscript𝑌2Y_{2}italic_Y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT). For simplicity, we do not explicitly denote these special tokens and assume they are part of the spans.

3.2 Rearrangement Step 2: Delayed Stacking

After the causal masking token rearrangement, each timestep of the rearranged matrix Y𝑌Yitalic_Y is vector of K𝐾Kitalic_K tokens. Copet et al. (2023) observed that when performing autoregressive generation over stacked RVQ tokens, it is advantageous to apply a delay pattern so that the prediction of codebook k𝑘kitalic_k at time t𝑡titalic_t can be conditioned on the prediction of codebook k1𝑘1k-1italic_k - 1 from the same timestep. We take a similar approach which we describe here. Assume a span Yssubscript𝑌𝑠Y_{s}italic_Y start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT is of shape Ls×Ksubscript𝐿𝑠𝐾L_{s}\times Kitalic_L start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT × italic_K. Applying the delay pattern rearranges it into Zs=(Zs,0,Zs,1,,Zs,Ls+K1)subscript𝑍𝑠subscript𝑍𝑠0subscript𝑍𝑠1subscript𝑍𝑠subscript𝐿𝑠𝐾1Z_{s}=(Z_{s,0},Z_{s,1},\cdots,Z_{s,L_{s}+K-1})italic_Z start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = ( italic_Z start_POSTSUBSCRIPT italic_s , 0 end_POSTSUBSCRIPT , italic_Z start_POSTSUBSCRIPT italic_s , 1 end_POSTSUBSCRIPT , ⋯ , italic_Z start_POSTSUBSCRIPT italic_s , italic_L start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT + italic_K - 1 end_POSTSUBSCRIPT ), where Zs,t,t[Ls+K1]subscript𝑍𝑠𝑡𝑡delimited-[]subscript𝐿𝑠𝐾1Z_{s,t},t\in[L_{s}+K-1]italic_Z start_POSTSUBSCRIPT italic_s , italic_t end_POSTSUBSCRIPT , italic_t ∈ [ italic_L start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT + italic_K - 1 ] is defined as222[N]delimited-[]𝑁[N][ italic_N ] represents integer set {0,1,,N}01𝑁\{0,1,\cdots,N\}{ 0 , 1 , ⋯ , italic_N }:

Zs,t=(Ys,t,1,Ys,t+1,2,,Ys,tK+1,K)subscript𝑍𝑠𝑡subscript𝑌𝑠𝑡1subscript𝑌𝑠𝑡12subscript𝑌𝑠𝑡𝐾1𝐾\displaystyle Z_{s,t}=(Y_{s,t,1},Y_{s,t+1,2},\cdots,Y_{s,t-K+1,K})italic_Z start_POSTSUBSCRIPT italic_s , italic_t end_POSTSUBSCRIPT = ( italic_Y start_POSTSUBSCRIPT italic_s , italic_t , 1 end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_s , italic_t + 1 , 2 end_POSTSUBSCRIPT , ⋯ , italic_Y start_POSTSUBSCRIPT italic_s , italic_t - italic_K + 1 , italic_K end_POSTSUBSCRIPT ) (1)

where Ys,tk+1,ksubscript𝑌𝑠𝑡𝑘1𝑘Y_{s,t-k+1,k}italic_Y start_POSTSUBSCRIPT italic_s , italic_t - italic_k + 1 , italic_k end_POSTSUBSCRIPT denotes the token located at coordinate (tk+1,k)𝑡𝑘1𝑘(t-k+1,k)( italic_t - italic_k + 1 , italic_k ) in matrix Yssubscript𝑌𝑠Y_{s}italic_Y start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, i.e. the k𝑘kitalic_kth codebook entry at the (tk+1)𝑡𝑘1(t-k+1)( italic_t - italic_k + 1 )th timestep. To make sure that t[Ls+K1]for-all𝑡delimited-[]subscript𝐿𝑠𝐾1\forall t\in[L_{s}+K-1]∀ italic_t ∈ [ italic_L start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT + italic_K - 1 ], Zs,tsubscript𝑍𝑠𝑡Z_{s,t}italic_Z start_POSTSUBSCRIPT italic_s , italic_t end_POSTSUBSCRIPT contains K𝐾Kitalic_K valid tokens, we introduce a special learnable [empty] token and define Ys,tk+1,k[empty],t{s:s<ksk+1>Ls}formulae-sequencesubscript𝑌𝑠𝑡𝑘1𝑘[empty]for-all𝑡conditional-set𝑠𝑠expectation𝑘𝑠𝑘1subscript𝐿𝑠Y_{s,t-k+1,k}\triangleq\texttt{[empty]}\,,\forall t\in\{s:s<k\cup s-k+1>L_{s}\}italic_Y start_POSTSUBSCRIPT italic_s , italic_t - italic_k + 1 , italic_k end_POSTSUBSCRIPT ≜ [empty] , ∀ italic_t ∈ { italic_s : italic_s < italic_k ∪ italic_s - italic_k + 1 > italic_L start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT }. Note that the mask tokens are not part of any span and are not changed during delayed stacking. We define the resulting matrix of delayed stacking Z=(Z1,⟨M1,Z2,⟨M1,,⟨MS12,ZS)𝑍subscript𝑍1⟨M1subscript𝑍2⟨M1⟨MS12subscript𝑍𝑆Z=(Z_{1},\text{\textlangle M${}_{1}$\textrangle},Z_{2},\text{\textlangle M${}_% {1}$\textrangle},\cdots,\text{\textlangle M${}_{\frac{S-1}{2}}$\textrangle},Z_% {S})italic_Z = ( italic_Z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⟨M ⟩ , italic_Z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⟨M ⟩ , ⋯ , ⟨M start_FLOATSUBSCRIPT divide start_ARG italic_S - 1 end_ARG start_ARG 2 end_ARG end_FLOATSUBSCRIPT ⟩ , italic_Z start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ) (assuming Y𝑌Yitalic_Y consists of S𝑆Sitalic_S spans). See the diagram for Z𝑍Zitalic_Z in Fig. 2 for an illustration.

3.3 Modeling

As shown in the right hand side of Fig. 2, we use a Transformer decoder to model Z𝑍Zitalic_Z autoregressively, conditioned on transcript of the speech W𝑊Witalic_W. Therefore, the input to the decoder is [W;Z]𝑊𝑍[W;Z][ italic_W ; italic_Z ], where “;” denotes concatenation. At timestep t𝑡titalic_t of span s𝑠sitalic_s in codec matrix Z𝑍Zitalic_Z, the model predicts all K𝐾Kitalic_K tokens of Zs,tsubscript𝑍𝑠𝑡Z_{s,t}italic_Z start_POSTSUBSCRIPT italic_s , italic_t end_POSTSUBSCRIPT simultaneously, by using K𝐾Kitalic_K MLP heads to project the transformer’s final hidden state to K𝐾Kitalic_K sets of logits, one for each of the K𝐾Kitalic_K codebooks. Note that the prediction is conditioned on transcript W𝑊Witalic_W, and all tokens in Z𝑍Zitalic_Z before Zs,tsubscript𝑍𝑠𝑡Z_{s,t}italic_Z start_POSTSUBSCRIPT italic_s , italic_t end_POSTSUBSCRIPT, denoted as Hs,tsubscript𝐻𝑠𝑡H_{s,t}italic_H start_POSTSUBSCRIPT italic_s , italic_t end_POSTSUBSCRIPT. Mathematically, the Transformer decoder models the factorized conditional distribution of Z𝑍Zitalic_Z:

θ(Z|W)subscript𝜃conditional𝑍𝑊\displaystyle\mathbb{P}_{\theta}(Z|W)blackboard_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_Z | italic_W ) =stθ(Zs,t|W,Hs,t)absentsubscriptproduct𝑠subscriptproduct𝑡subscript𝜃conditionalsubscript𝑍𝑠𝑡𝑊subscript𝐻𝑠𝑡\displaystyle=\prod_{s}\prod_{t}\mathbb{P}_{\theta}(Z_{s,t}|W,H_{s,t})= ∏ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ∏ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_Z start_POSTSUBSCRIPT italic_s , italic_t end_POSTSUBSCRIPT | italic_W , italic_H start_POSTSUBSCRIPT italic_s , italic_t end_POSTSUBSCRIPT ) (2)
=stk=1Kθ(Zs,t,k|W,Hs,t)absentsubscriptproduct𝑠subscriptproduct𝑡superscriptsubscriptproduct𝑘1𝐾subscript𝜃conditionalsubscript𝑍𝑠𝑡𝑘𝑊subscript𝐻𝑠𝑡\displaystyle=\prod_{s}\prod_{t}\prod_{k=1}^{K}\mathbb{P}_{\theta}(Z_{s,t,k}|W% ,H_{s,t})= ∏ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ∏ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∏ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT blackboard_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_Z start_POSTSUBSCRIPT italic_s , italic_t , italic_k end_POSTSUBSCRIPT | italic_W , italic_H start_POSTSUBSCRIPT italic_s , italic_t end_POSTSUBSCRIPT ) (3)

Where θ𝜃\thetaitalic_θ represent the parameters of the model. Equation 2 is the autoregressive factorization across time, while Equation 3 is the factorization across codebooks given an independence assumption - given W𝑊Witalic_W and Hs,tsubscript𝐻𝑠𝑡H_{s,t}italic_H start_POSTSUBSCRIPT italic_s , italic_t end_POSTSUBSCRIPT, the K𝐾Kitalic_K RVQ codes in Zs,tsubscript𝑍𝑠𝑡Z_{s,t}italic_Z start_POSTSUBSCRIPT italic_s , italic_t end_POSTSUBSCRIPT are assumed to be independent of each other. We argue in appendix D that this assumption is mild.

With the token level probability formulation in Equation 3, we derive the training loss as the negative log likelihood (θ)=logθ(Z|W)=k=1Kk(θ)𝜃subscript𝜃conditional𝑍𝑊superscriptsubscript𝑘1𝐾subscript𝑘𝜃\mathcal{L}(\theta)=-\log\mathbb{P}_{\theta}(Z|W)=-\sum_{k=1}^{K}\mathcal{L}_{% k}(\theta)caligraphic_L ( italic_θ ) = - roman_log blackboard_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_Z | italic_W ) = - ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_θ ). Empirically, we found that weighting the first residual codebooks more than the latter codebooks leads to better performance, and therefore our final loss is (θ)=k=1Kαkk(θ)𝜃superscriptsubscript𝑘1𝐾subscript𝛼𝑘subscript𝑘𝜃\mathcal{L}(\theta)=\sum_{k=1}^{K}\alpha_{k}\mathcal{L}_{k}(\theta)caligraphic_L ( italic_θ ) = ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_θ ), where (αk)k=1Ksuperscriptsubscriptsubscript𝛼𝑘𝑘1𝐾(\alpha_{k})_{k=1}^{K}( italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT are tunable hyperparameters. Note that we follow Aghajanyan et al. (2022) and calculate the prediction loss on all tokens (not just the tokens in the masked spans), except for mask tokens and [empty] tokens.

Table 1: Examples of the speech editing dataset RealEdit. More examples are shown in table 8.
Edit Types Original Edited
deletion I wrote the title of the course many years ago, ah, when I created this course. I wrote the title when I created this course.
insertion And we’re at this point. And we’re all extremely excited at this point.
substitution, substitution See why it’s extremely valuable to it’s kind of like it’s kind of like having a wall hack to watch a demo. See why it’s extremely important right? it’s kind of like having a rough time to watch a demo.

3.4 Inference

Speech Editing. The setting for speech editing is the following: we have a speech recording R𝑅Ritalic_R and its transcript W𝑊Witalic_W, and we want the model to modify only the relevant spans of R𝑅Ritalic_R so that it matches the target transcript Wsuperscript𝑊W^{\prime}italic_W start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. We assume that Wsuperscript𝑊W^{\prime}italic_W start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is an edited version of W𝑊Witalic_W, where some words have been inserted, substituted, or deleted. This task is almost exactly the same as the training task, with two differences: 1) during training, the input transcript is simply the transcript of the original recording W𝑊Witalic_W, while during inference it is a modified transcript Wsuperscript𝑊W^{\prime}italic_W start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT 2) during training, the spans to be masked (i.e. edited) are chosen randomly. During inference, we select them by comparing the original transcript and the target transcript to identify the words that should be masked out, and then use the word level forced alignment of the original transcript to identify the codec token spans that correspond to these words to be masked. To ensure a smooth transition between the edited speech and the unedited speech, the neighboring words surrounding the span to be edited also need to be slightly modified in order to model co-articulation effects. Therefore, we specify a small margin hyperparameter ϵitalic-ϵ\epsilonitalic_ϵ, and extend the mask span length by ϵitalic-ϵ\epsilonitalic_ϵ on both the left and right sides333for substitution and deletion, the spans that are to be masked are just those words that are different from the target plus the margin; for insertion, the spans are just left and right margin spanning from the middle of the two words where the insertion happens. During autoregressive generation, we feed the model the target transcript with all unmasked spans, with mask tokens inserted in the locations where the edits should take place. We then have the model autoregressively continue this sequence, whereby it fills in the masked spans. The generated codec tokens are then spliced back into their correct location in the utterance, and we map the complete codec token sequence back to a waveform using the Encodec decoder network.

Zero-shot TTS. As we previously noted, zero-shot TTS for our model is straightforward because it simply corresponds to performing an insertion edit at the end of the original utterance. In this case, the model is provided a voice prompt with its transcription, as well as the target transcript of the speech to be generated. The three inputs are concatenated together and fed to the model, after which it generates the codec sequence of the target transcript autoregressively.

4 RealEdit: a realistic and challenging speech editing dataset

To support as realistic an evaluation as possible, we constructed a first-of-its-kind dataset of 310310310310 manually-crafted speech editing examples. Each example consists of a tuple: (original audio, original transcript, edited transcript). The dataset contains 100100100100 utterances from LibriTTS (dev-clean and dev-other) (Zen et al., 2019), 100100100100 utterances from YouTube (from Gigaspeech testset) (Chen et al., 2021a) and 110110110110 utterances from the Spotify Podcast dataset (Clifton et al., 2020). We manually checked the utterances for accuracy, then had native English speakers revise them to create edited transcripts. For each utterance, we determine the type of modification using predefined probability distributions of editing type, number of disjoint spans to be edited, and editing span length. Specifically, we study the following categories: 1) number of edited spans: 1111 or 2222; 2) type of edits: insertion, deletion and substitution; 3) editing span length: short (1111-2222 words), medium (3333-6666 words), long (7777-12121212 words). Crucially, a edited transcript must be grammatically correct and semantically coherent. Examples of the dataset are shown in table 1 and 8, and statistics are shown in table 2,

Table 2: Dataset statistics for speech editing evaluation. Note that for 2-span editing, each example is edited using 2222 of the 3333 edit types.
length type Insert. Delet. Substi. Total
1-2 words (1 span) 8 17 38 63
3-6 words (1 span) 22 24 79 125
7-12 words (1 span) 15 11 56 82
1 span total 45 52 173 270
2 spans total 13 13 54 40

5 Experiments

5.1 Setup

Data. Gigaspeech training set (Chen et al., 2021a) is used as the training data, which contains 9k hours of audiobooks, podcasts, and YouTube videos at 16kHz audio sampling rate. Audio files that shorter than 2 seconds are dropped. For ablation studies, we use the masked reconstruction task, and a 1000-utterance random subset of Gigaspeech validation set as the testing utterances (detailed in §C). For speech editing evaluation, we use the proposed RealEdit dataset. For zero-shot TTS evaluation, we constructed a 250250250250 prompt-transcript paired dataset from LibriTTS (Zen et al., 2019) and the YouTube portion of the Gigaspeech test set, with half of the examples drawn from each dataset. The length of each voice prompt is kept as close as possible to 3 seconds long, with the constraint applied that we only cut the audio between complete words. The transcript is a concatenation of the transcript of the voice prompt and the target transcript. The target transcripts are chosen from different utterances spoken by the same speaker as the prompt, and range from 8888 to 40404040 words in length. We only select utterances with a WER lower than 15% by Whisper medium.en Radford et al. (2022).

Model. Encodec (Defossez et al., 2022) is used as the speech tokenizer, which has 4444 RVQ codebooks each with vocabulary size of 2048204820482048, and a codec framerate of 50Hz on 16kHz recordings. (see §C for detailed config). To choose the number of spans to mask in training, we use a Poison(1111) distribution truncated to a minimum of 1111 and maximum of 3333. Span lengths are sampled from Uniform(1111, 600600600600) i.e. the masked speech can be as long as 12121212 seconds. At each time step, the embeddings of codes from different codebooks are summed (Wang et al., 2023a), then added by sinusoidal positional encoding (Vaswani et al., 2017), before being fed to the transformer. Text transcripts are phonemized based on the IPA phoneset using the toolkit provided by Bernard and Titeux (2021). Our main VoiceCraft model has 16161616 transformers layer with hidden/FFN dimensions of 2048204820482048/8192819281928192, and 12121212 attention heads. The output of the last layers are fed to four separate 2222-layer MLP modules to get prediction logits. Our Main model has 830M parameters and codebook weight hyperparameters α𝛼\alphaitalic_α is set to be (5,1,0.5,0.1)510.50.1(5,1,0.5,0.1)( 5 , 1 , 0.5 , 0.1 ). Ablations on model sizes and codebook weights are shown in §5.2.

Training and inference. The training of the Encodec model largely follows the setting in  Copet et al. (2023), detailed in §C. To train VoiceCraft, we used the ScaledAdam optimizer and Eden Scheduler proposed in (Yao et al., 2024) with a base learning rate of 0.050.050.050.05, batch size of 400k frames (i.e. 133.2 minutes), and total training step of 50k with gradient accumulation. The training of the 830M VoiceCraft model took about 2 weeks on 4444 NVIDIA A40 GPUs. More details can be found in §C. We compare the performance of ScaledAdam and AdamW in §A.1. For inference, we use Nucleus sampling (Holtzman et al., 2020) with p=0.8𝑝0.8p=0.8italic_p = 0.8 and a temperature of 1111 for all experiments. Due to the stochasticity of autoregressive generation, via manual inspection we found that while most of the time the model produces natural sounding speech, it sometimes produces excessively long silence or drags out certain sounds. We found that happens when the codec token generation gets stuck in a repeating loop. To resolve it, we use a simple heuristic: for each input utterance we generate several different output utterances and throw away the longest outputs. Specifically for speech editing, we run inference 10101010 times with different margin parameters, stepping ϵitalic-ϵ\epsilonitalic_ϵ up from 0.050.050.050.05 to 0.140.140.140.14 in 0.010.010.010.01 increments. The 4444 longest outputs are discarded, and then we randomly select one sample from the remaining 6 outputs. For zero-shot TTS, we reduce the probability of generating the same token in consecutive timesteps in proportion to how many times that token was consecutively generated in the immediately preceding timesteps. In addition, we generate 5555 samples with different random seeds, and select the shortest for TTS evaluation. The sample selection process is completely automatic and unsupervised (i.e. no human intervention or ASR scoring).

Baselines. For speech editing, we compare VoiceCraft with the diffusion-based model FluentSpeech (Jiang et al., 2023b) which is the current open-source SotA model for speech editing. Since the original FluentSpeech model is trained on LibriTTS, for a fair comparison, we took the official GitHub repo and trained the model on Gigaspeech. Please find more details in §C. For zero-shot TTS, we compare our VoiceCraft with VALL-E Wang et al. (2023a), XTTS v2 (COQUI, 2023), YourTTS (Casanova et al., 2021), and FluentSpeech. Since the original VALL-E is not open-sourced, we use the code from the popular open-source implementation by Li (2023), and also trained the model on Gigaspeech. XTTS v2 is a popular commercial zero-shot TTS model444The GitHub repo hosting XTTS v2 has 26k stars by Jan 2024. trained on a mixture of publicly available data and web-crawled data, although the exact data sources are unknown. YourTTS is trained on VCTK, LibriTTS, and also French and Portugese corpora.

Table 3: Effect of scaling model sizes and codebook re-weighting. Lower is better for all metrics.
Params Weights WER MCD F0 Energy
120M (1,1,1,1) 10.18 8.75 78.49 3.22
120M (5,1,0.5,0.1) 7.75 8.31 87.74 3.54
430M (1,1,1,1) 7.87 8.22 70.05 3.17
430M (5,1,0.5,0.1) 7.30 8.13 73.41 3.19
830M (5,1,0.5,0.1) 6.68 8.05 67.81 3.12
Table 4: Performance comparison on speech editing.
Intelligibility MOS Naturalness MOS
Model WER LibriTTS YouTube Spotify Total LibriTTS YouTube Spotify Total
FluentSpeech 4.5 3.89±plus-or-minus\pm±0.09 4.08±plus-or-minus\pm±0.08 3.95±plus-or-minus\pm±0.08 3.97±plus-or-minus\pm±0.05 3.42±plus-or-minus\pm±0.10 4.07±plus-or-minus\pm±0.10 3.93±plus-or-minus\pm±0.10 3.81±plus-or-minus\pm±0.06
VoiceCraft 6.1 4.05±plus-or-minus\pm±0.08 4.14±plus-or-minus\pm±0.07 4.12±plus-or-minus\pm±0.07 4.11±plus-or-minus\pm±0.05 3.68±plus-or-minus\pm±0.10 4.25±plus-or-minus\pm±0.09 4.16±plus-or-minus\pm±0.08 4.03±plus-or-minus\pm±0.05
Original 5.4 4.22±plus-or-minus\pm±0.07 4.30±plus-or-minus\pm±0.07 4.16±plus-or-minus\pm±0.08 4.22±plus-or-minus\pm±0.05 3.84±plus-or-minus\pm±0.09 4.35±plus-or-minus\pm±0.08 4.29±plus-or-minus\pm±0.08 4.17±plus-or-minus\pm±0.05
Table 5: Side-by-side naturalness comparison of VoiceCraft (VCr) v.s. Original (Orig.) and FluentSpeech (FS).
Comparison VCr better Tie VCr worse
VoiceCraft v. FS 56.1% 19.7% 24.1%
VoiceCraft v. Orig. 40.3% 16.2% 43.6%
Refer to caption
Figure 3: Breakdown of side-by-side human preference on naturalness comparing VoiceCraft edited speech and the original speech. Grouped by edit type (left) and edit span length (right).
Table 6: On the zero-shot TTS task, comparing VoiceCraft with other models.
Intelligibility MOS Naturalness MOS Speaker Similarity MOS
Model WER SIM Libri. YouTube Total Libri. YouTube Total Libri. YouTube Total
YourTTS 6.6 0.41 3.28±plus-or-minus\pm±0.11 3.01±plus-or-minus\pm±0.12 3.14±plus-or-minus\pm±0.08 2.99±plus-or-minus\pm±0.12 2.59±plus-or-minus\pm±0.12 2.79±plus-or-minus\pm±0.08 3.10±plus-or-minus\pm±0.12 2.49±plus-or-minus\pm±0.12 2.79±plus-or-minus\pm±0.09
FluentSpeech 3.5 0.47 3.70±plus-or-minus\pm±0.11 3.65±plus-or-minus\pm±0.12 3.67±plus-or-minus\pm±0.08 3.34±plus-or-minus\pm±0.11 3.43±plus-or-minus\pm±0.12 3.38±plus-or-minus\pm±0.08 4.10±plus-or-minus\pm±0.09 3.92±plus-or-minus\pm±0.11 4.01±plus-or-minus\pm±0.07
VALL-E 7.1 0.50 4.05±plus-or-minus\pm±0.09 3.94±plus-or-minus\pm±0.10 4.00±plus-or-minus\pm±0.07 3.85±plus-or-minus\pm±0.10 3.86±plus-or-minus\pm±0.10 3.86±plus-or-minus\pm±0.07 4.12±plus-or-minus\pm±0.10 4.02±plus-or-minus\pm±0.10 4.07±plus-or-minus\pm±0.07
XTTS v2 3.6 0.47 4.29±plus-or-minus\pm±0.09 3.97±plus-or-minus\pm±0.10 4.13±plus-or-minus\pm±0.07 4.02±plus-or-minus\pm±0.09 3.90±plus-or-minus\pm±0.10 3.96±plus-or-minus\pm±0.07 3.64±plus-or-minus\pm±0.12 3.25±plus-or-minus\pm±0.12 3.44±plus-or-minus\pm±0.08
VoiceCraft 4.5 0.55 4.38±plus-or-minus\pm±0.08 4.08±plus-or-minus\pm±0.10 4.23±plus-or-minus\pm±0.06 4.16±plus-or-minus\pm±0.08 4.18±plus-or-minus\pm±0.09 4.17±plus-or-minus\pm±0.06 4.35±plus-or-minus\pm±0.08 4.33±plus-or-minus\pm±0.09 4.34±plus-or-minus\pm±0.06
Ground Truth 3.8 0.76 4.37±plus-or-minus\pm±0.08 4.42±plus-or-minus\pm±0.08 4.39±plus-or-minus\pm±0.06 4.32±plus-or-minus\pm±0.08 4.64±plus-or-minus\pm±0.06 4.48±plus-or-minus\pm±0.05 4.26±plus-or-minus\pm±0.10 4.62±plus-or-minus\pm±0.08 4.44±plus-or-minus\pm±0.06

Metrics. For ablation studies, since ground truth waveform is avaliable, in addition to WER (using Whisper medium.en as the ASR model), we use mel-ceptral distortion (MCD), F0 distance (F0) and energy distance (Energy). These are all objective metrics and their definitions are detailed in §C. For speech editing and zero-shot TTS evaluation, we use a combination of objective and subjective metrics. For the objective metrics, we used WER and speaker similarity (SIM) following prior works(Wang et al., 2023a; Kharitonov et al., 2023). SIM is calculated using the WavLM-TDCNN (Chen et al., 2021b). WER and SIM are calculated on all 310310310310 utterances in RealEdit, and 250250250250 utterances in the zero-shot TTS dataset. For our subjective evaluation, we used the Amazon Mechanical Turk platform to conduct human listening tests. For speech editing, the outputs of our model on all 310310310310 utterances from RealEdit are evaluated by Turkers in terms of naturalness and intelligibility, and we use a 5555-point Likert scale where 1111 means poor and 5555 means excellent. We also performed side-by-side A/B testing of VoiceCraft’s output against the original (non-edited) speech, as well as the edited speech produced by FluentSpeech. In both cases, Turkers were asked to determine which utterance sounds more natural. The Turkers can choose either one of the two, or indicate that they are equally natural. Each evaluation received 5555 ratings from 5 different Turkers. For zero-shot TTS, we randomly sampled 80808080 utterances (40404040 from LibriTTS and 40404040 from YouTube) from the original evaluation set, and asked Turkers to rate the naturalness, intelligibility, and speaker similarity of the generated speech to the reference prompt on a 5555-point Likert scale. Each evaluation received 10101010 ratings. For all evaluations except the side-by-side comparison, Mean-Opinion-Score (MOS) with 95%percent9595\%95 % confidence interval are reported. For the side-by-side comparison, we report the percentage of the time one model is preferred over the other. 64646464 and 59595959 Turkers participated in speech editing and TTS evaluation respectively. Please refer to §E for instructions and participants description.

5.2 Ablations

In table 3, we see that larger model sizes lead to better performance across all metrics . In addition, we see a bigger gap between the bigger models, indicating the potential of further scaling model (and possibly training data) sizes. For the impact of codebook re-weighting, and we see that weighting earlier codebook heavier leads to better performance on intelligibility related metrics WER and MCD, while worse performance on prosody related metrics F0 and Energy555This can be regarded as a probing results that shows the properties of different codebooks in RVQ models. Since this is not the focus of our work, we do not conduct further experiment on this direction.. We choose weight (5,1,0.5,0.1)510.50.1(5,1,0.5,0.1)( 5 , 1 , 0.5 , 0.1 ) in our final 830M model because anecdotally, we found that VoiceCraft is stronger in prosody compared to intelligibility (similar properties about NCLMs are also found in (Jiang et al., 2023a; Song et al., 2024; Du et al., 2024b))

5.3 Speech Editing Results

Table 4 shows the results of speech editing evaluation in terms of WER, and human preference on intelligibility and naturalness. Our VoiceCraftoutperforms FluentSpeech on both intelligibility and naturalness MOS across different sources. Interestingly, FluentSpeech achieves a WER lower than the original recording (4.54.54.54.5 v.s. 5.45.45.45.4), although its intelligibility MOS (3.973.973.973.97) is worse than both VoiceCraft (4.114.114.114.11) and original recording (4.224.224.224.22). This suggests that ASR model and human judgement diverge on FluentSpeech’s intelligibility. Anecdotally, we observe that FluentSpeech tends to produce dull and sometimes robotic speech 666please refer to our demo page for examples, and we hypothesize that this type of speech tends be more easily recognized by ASR, but is less intelligible to human ears. We notice this same phenomenon in our results on zero-shot TTS.

Human listeners rate LibriTTS’s naturalness lower than YouTube and Spotify on original speech (results on TTS is consistent with this). This suggests that to better evaluate speech synthesis in general, the research community should consider evaluating on other speech domains besides audiobooks as is commonly done.

Table 5 presents side-by-side utterance naturalness comparison of VoiceCraft vs. FluentSpeech and VoiceCraft vs. the original, unedited speech. We observe that VoiceCraft is preferred over FluentSpeech 56.1%percent56.156.1\%56.1 % of the time, with an additional 19.7%percent19.719.7\%19.7 % of the time the two are tied. This means that 75.9%percent75.975.9\%75.9 % of the time, human listeners’ think VoiceCraft produces equal or more natural speech than FluentSpeech. Impressively, human listeners judge the edited speech produced by VoiceCraft to be equally or more natural than the original unedited speech 56.4%percent56.456.4\%56.4 % of the time. Fig. 4 shows the breakdown of the side-by-side comparisons by edit type and edit span length. We see that compared to the original speech, VoiceCraft performs consistently well across different edit types, but human listeners think its outputs are slightly less natural with longer edit span(s).

5.4 Zero-Shot TTS Results

Table 6 shows both objective and subjective evaluation on zero-shot TTS. We observe that VoiceCraft achieves the best results in both automatic speaker similarity metric SIM, and all human evaluation metrics. In particular, VoiceCraft is only slightly worse than ground truth in terms of intelligibility MOS (4.234.234.234.23 v.s. 4.394.394.394.39), and speaker similarity MOS (4.344.344.344.34 v.s. 4.444.444.444.44). The gap on naturalness is larger between VoiceCraft and ground truth (4.174.174.174.17 v.s. 4.484.484.484.48), especially on YouTube utterances, which highlights the challenges of zero-shot TTS on noisy, in-the-wild data. The commercial model XTTS v2 comes second in terms of intelligibility and naturalness, and second to last on speaker similarity MOS. VALL-E achieves the second best on both automatic metric SIM and subjective metric speaker similarity MOS. Similarly to the speech editing results, ground truth YouTube utterances receive higher MOS scores than ground truth LibriTTS utterances in Table 6, which again suggests that we should consider using more diverse data for future speech synthesis model evaluation. Lastly, we again observe that FluentSpeech achieves lower WER than the ground truth, but receives much lower ratings in terms of intelligibility MOS from human listeners, indicating that WER could be misleading in evaluating intelligibility of speech synthesis systems777we also tried Whisper Large-v3, it gets WER of 4.14.14.14.1 for ground truth, and 2.72.72.72.7 for FluentSpeech..

6 Conclusion

We introduce a neural codec language model VoiceCraft that achieves state-of-the-art performance on speech editing and zero-shot TTS on in-the-wild data. The key lies in an innovative token rearrangement procedure which enables efficient and effective autoregressive codec generation with bidirectional context. In addition, we introduce a first-of-its-kind high quality, challenging, and realistic speech editing dataset RealEdit, which we believe can reliably measure the practicality of speech editing models.

7 Limitations

Given the advancement of made by VoiceCraft, there are still limitations. First and foremost is the long silence and scratching sound that occasionally occur during generation. Although in this work, we overcome it with sampling multiple utterances and selecting the shorter ones, more elegant and efficient methods are needed. Another important aspect is AI safety, how can we watermark and detect synthesized speech? While watermarking and deepfake detection has attracted increasing attention in the research community, and remarkable progress has been made (Zhang et al., 2020; Yamagishi et al., 2021; Chen et al., 2023; Roman et al., 2024), more advanced models such as VoiceCraft presents new opportunities and challenges to safety research. To facilitate speech synthesis and AI safety research, we fully open source our codebase and model weights.

8 Ethical Implications

The speech synthesis model VoiceCraft introduced in this work has both positive and negative implications.

On the positive side, VoiceCraft holds the promise of significant benefits across several domains. For individuals with speech impairments or who have lost the use of their voice, VoiceCraft could be transformative, enabling these individuals new ways to communicate with ease and clarity that were previously not possible. Content creators, whether they work in education, video production, or podcasting, could leverage VoiceCraft  to streamline their editing processes, making it easier to produce high-quality content without the need to re-record takes when they contain a small mistake. Furthermore, VoiceCraft’s ability to handle diverse accents without compromising on quality opens up new possibilities for creating synthetic data. This could, in turn, enhance speech recognition systems, such as Voicebox (Le et al., 2023), by providing them with a richer and more varied dataset to learn from, thereby improving their accuracy and accessibility to users worldwide.

However, the potential negative impacts of VoiceCraft cannot be overlooked. One of the primary concerns is the model’s potential to exacerbate existing biases, particularly those related to ethnicity. If not carefully monitored and corrected, these biases could lead to unequal performance across different groups, perpetuating and possibly even worsening existing disparities. Moreover, the ease with which voices can be cloned raises serious concerns about misuse, including impersonation and fraud. The ability to replicate someone’s voice with only a few seconds of reference audio could be exploited to commit crimes or spread misinformation, posing significant ethical and security challenges. As such, while the benefits of VoiceCraft are clear and substantial, it is imperative to approach its deployment with caution, ensuring that measures are in place to mitigate these risks and protect against potential misuse.

Despite the concerns regarding impersonation and fraud associated with VoiceCraft, there are compelling reasons to advocate for its release. Foremost among these is the opportunity it presents for the broader research community and technology developers to better understand and mitigate these negative impacts. By making these methods open source, we can catalyze the development of more robust countermeasures against the misuse of voice cloning technologies. This collaborative approach allows for the rapid identification of vulnerabilities and the exploration of innovative strategies to address them. Moreover, the authors of this work fully committed to advancing the field responsibly. We are actively working on pioneering deepfake detection and watermarking algorithms specifically designed for synthetic speech. By doing so, we not only acknowledge the potential risks associated with our technology but also take concrete steps to ensure its ethical use. This dual approach of open collaboration and dedicated research into safeguarding mechanisms reflects our commitment to fostering a technological ecosystem where the benefits of voice cloning can be realized while minimizing its potential for harm.

9 Acknowledgements

We thank Ziyue Jiang for providing guidance in running inference with and scaling up the FluentSpeech model. We thank students at SALT Lab of UT Austin for helpful discussions. This work is supported in part by the National Science Foundation under Grant No. 2238605.

References

Appendix A Additional Experiments

A.1 Comparing ScaledAdam and AdamW

The hyperparameters settings of ScaledAdam can be found in table 9. For AdamW (Loshchilov and Hutter, 2017), we tried 3 settings:

  • setting1: peak learning rate: 1e-5, batch size: 3.3 min, update steps: 500k

  • setting2: peak learning rate: 1e-4, batch size: 33.3 min (same as ScaledAdam), update steps: 80k

  • setting3: peak learning rate: 1e-4, batch size: 3.3 min, update steps: 500k

For all settings, we use a linear scheduler which linear ramp up the learning rate to peak in first 8% steps, and linearly decay it afterwards. We use the common default values for other hyperparameters, setting β1=0.9,β2=0.999formulae-sequencesubscript𝛽10.9subscript𝛽20.999\beta_{1}=0.9,\beta_{2}=0.999italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.9 , italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.999, weight-decay=0.01weight-decay0.01\text{weight-decay}=0.01weight-decay = 0.01. All experiments are done on 4 A40 GPUs. Results are shown in table 7.888We early stopped AdamW setting 2 at step 57k to save the compute, as it has already taken more time than the finished ScaledAdam job while the performance was worse. We see that ScaledAdam achieves better performance in all metrics while using less compute. However we note that due to limitation in computational resources, we could not exhaust hyperparameter search for AdamW, therefore we do not over-generalize our finding here.

Table 7: ScaledAdam consistently outperforms AdamW across all metrics, while taking 10% less time to train.
Optimizer Setting Training Time WER MCD F0 Energy
AdamW lr=1e-5, bsz=13.3min, steps=500k 262 hours 16.45 8.91 196.15 5.94
AdamW lr=1e-4, bsz=133.2min, steps=57k 273 hours 10.77 8.45 117.38 4.91
AdamW lr=1e-4, bsz=13.3min, steps=500k 262 hours 7.58 8.32 82.73 3.70
ScaledAdam lr=3e-2, bsz=133.2min, steps=50k 237 hours 7.30 8.13 73.41 3.19

A.2 Breakdown of side-by-side human preference comparison.

The comparison breakdown between VoiceCraft and FluentSpeech is shown in figure 4. We see that VoiceCraft outperforms FluentSpeech across the board, especially for substitution edits and when the edit span length is long.

Refer to caption
Figure 4: Breakdown of side-by-side human preference on naturalness comparing of VoiceCraft and FluentSpeech on speech editing. Grouped by edit type (left) and edit span length (right).

A.3 Spectrograms Comparison

Refer to caption
Figure 5: Upper: FluentSpeech; lower: VoiceCraft
Refer to caption
Figure 6: upper: FluentSpeech; lower: VoiceCraft
Refer to caption
Figure 7: upper: FluentSpeech; lower: VoiceCraft. Note that since the speech is recorded in very challenging condition, the word alignment is not very accurate. We see that for FluentSpeech’s result, since the entire mel-spectrogram are passed to HiFi-GAN for resynthesis, even the unedited speech contains high frequency noise.

Spectrogram level comparison between FluentSpeech and VoiceCraftare shown in figure 567 with the edited part marked in dark green rectangle. The three examples have increasing difficulty in terms of accents and recording conditions, in particular, the examples in figure 7 appears to be in low bandwidth transmission. In all 3 examples, we see that VoiceCraft is able to generated more detailed frequency patterns. The corresponding audio can be found in the demo page.

Appendix B Examples of the Speech Editing Dataset RealEdit

Examples of RealEdit are shown in table 8.

Table 8: Examples of the speech editing dataset RealEdit.
Edit Types Original Edited
substitution, substitution See why it’s extremely valuable to it’s kind of like it’s kind of like having a wall hack to watch a demo. See why it’s extremely important right? it’s kind of like having a rough time to watch a demo.
deletion I wrote the title of the course many years ago, ah, when I created this course. I wrote the title when I created this course.
insertion Fast cars, that had the nice clothes, that had the money, they was criminals. Fast cars, that had the nice clothes, that had expensive gold watches, that had the money, they was criminals.
substitution When the CEO of blockbuster heard that, he promptly had a kitchen sink delivered to the netflix office, a fairly creative way of declaring war. When the CEO of blockbuster heard that, he promptly had five hundred pounds of glitter divided into five thousand manilla envelopes delivered to the netflix office, a fairly creative way of declaring war.
substitution So if you’ve been following my story, you will remember that I said earlier in this podcast that the Grammy nominations came out. So if you’ve been following my story, you will remember that I said earlier that this week we had super exciting stuff to talk about because Grammy nominations came out.
insertion No to the chemical pollution, air pollution, and the destruction of the environment caused by factories and the manufacturing industry. No to the chemical pollution, air pollution, no to the killing of plants and wildlife and the destruction of the environment caused by factories and the manufacturing industry.
substitution, substitution because we can include so many other characters if we just expand the definitions to any sword wielder, who’s a little spicy. because we can include so many other participants if we are brave enough to expand the definitions to any blade wielder, who’s a little spicy.
insertion So for more craziness now that French was conquered we have to join forces to Great Britain. So for more craziness now that French was conquered by the Germans, we have to join forces to Great Britain.
substitution economic development remains one of the most effective ways to increase the capacity to adapt to climate change. economic development remains one of the most promising options that we have left on the table to increase the capacity to adapt to climate change.
insertion And we’re at this point. And we’re all extremely excited at this point.
insertion Steve also co-founded pixar animation studios. Which has revolutionized the film industry in it’s short history with brilliant use of technology. Steve also co-founded pixar animation studios. Which has revolutionized the film industry in it’s short history with films like toy story that showcase brilliant use of technology.
substitution, deletion this is just so cozy up here, and having that skylight is just lovely isn’t it. this is just so cozy and warm here, isn’t it.
substitution It was a glance of inquiry, ending in a look of chagrin, with some muttered phrases that rendered it more emphatic. It was a look of disgust followed by a curled lip, with some muttered phrases that rendered it more emphatic.
substitution More of a base and infrastructure to tell those stories rather than doing it out of a out of a tent with solar power. More of a base and infrastructure to fight these battles instead of out of a tent with solar power.

Appendix C Implementational Details

The Encodec model. The Encodec model we use has a stride of 320 samples, which means the codec framerate is 50Hz for recording of sample rate 16kHz. The base dimension is 64646464, doubling at each of the 5555 convolutional layer in the encoder. Following (Copet et al., 2023), we use the open-sourced audiocraft repo999Encodec training doc can be here for Encodec model training. 1111 second speech segments sampled from Gigaspeech over a total of 160160160160 epochs (320k steps) with a batch size of 240240240240. The model is trained with the Adam (Kingma and Ba, 2014) with base learning rate of 3e-4.

Eden Scheduler (Yao et al., 2024). the scheduler adjust the learning rate αtsubscript𝛼𝑡\alpha_{t}italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT at step t𝑡titalic_t using the following formula:

αt=subscript𝛼𝑡absent\displaystyle\alpha_{t}=italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = αbase(t2+αstep2αstep2)0.25\displaystyle\alpha_{\text{base}}\cdot\left(\frac{t^{2}+\alpha_{\text{step}}^{% 2}}{\alpha_{\text{step}}^{2}}\right)^{-0.25}\cdotitalic_α start_POSTSUBSCRIPT base end_POSTSUBSCRIPT ⋅ ( divide start_ARG italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_α start_POSTSUBSCRIPT step end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_α start_POSTSUBSCRIPT step end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) start_POSTSUPERSCRIPT - 0.25 end_POSTSUPERSCRIPT ⋅
(e2+αepoch2αepoch2)0.25\displaystyle\cdot\left(\frac{e^{2}+\alpha_{\text{epoch}}^{2}}{\alpha_{\text{% epoch}}^{2}}\right)^{-0.25}\cdot⋅ ( divide start_ARG italic_e start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_α start_POSTSUBSCRIPT epoch end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_α start_POSTSUBSCRIPT epoch end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) start_POSTSUPERSCRIPT - 0.25 end_POSTSUPERSCRIPT ⋅
linear(αstart,twarmup,t).absentlinearsubscript𝛼startsubscript𝑡warmup𝑡\displaystyle\cdot\text{linear}(\alpha_{\text{start}},t_{\text{warmup}},t).⋅ linear ( italic_α start_POSTSUBSCRIPT start end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT warmup end_POSTSUBSCRIPT , italic_t ) .

Where αbasesubscript𝛼base\alpha_{\text{base}}italic_α start_POSTSUBSCRIPT base end_POSTSUBSCRIPT base learning rate, t𝑡titalic_t is the step index, e𝑒eitalic_e is the epoch index, and αstepsubscript𝛼step\alpha_{\text{step}}italic_α start_POSTSUBSCRIPT step end_POSTSUBSCRIPT and αepochsubscript𝛼epoch\alpha_{\text{epoch}}italic_α start_POSTSUBSCRIPT epoch end_POSTSUBSCRIPT controls the amount of data the model has seen before significantly reducing the learning rate. linear(αstart,twarmup,t)linearsubscript𝛼startsubscript𝑡warmup𝑡\text{linear}(\alpha_{\text{start}},t_{\text{warmup}},t)linear ( italic_α start_POSTSUBSCRIPT start end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT warmup end_POSTSUBSCRIPT , italic_t ) linearly increase the outcome from αstartsubscript𝛼start\alpha_{\text{start}}italic_α start_POSTSUBSCRIPT start end_POSTSUBSCRIPT to 1111 over twarmupsubscript𝑡warmupt_{\text{warmup}}italic_t start_POSTSUBSCRIPT warmup end_POSTSUBSCRIPT steps, and stays at 1111. In our experiment, we set

αbase=0.05,αstep=3000,αepoch=4,formulae-sequencesubscript𝛼base0.05formulae-sequencesubscript𝛼step3000subscript𝛼epoch4\displaystyle\alpha_{\text{base}}=0.05,\alpha_{\text{step}}=3000,\alpha_{\text% {epoch}}=4,italic_α start_POSTSUBSCRIPT base end_POSTSUBSCRIPT = 0.05 , italic_α start_POSTSUBSCRIPT step end_POSTSUBSCRIPT = 3000 , italic_α start_POSTSUBSCRIPT epoch end_POSTSUBSCRIPT = 4 ,
αstart=0.5,twarmup=500formulae-sequencesubscript𝛼start0.5subscript𝑡warmup500\displaystyle\alpha_{\text{start}}=0.5,t_{\text{warmup}}=500italic_α start_POSTSUBSCRIPT start end_POSTSUBSCRIPT = 0.5 , italic_t start_POSTSUBSCRIPT warmup end_POSTSUBSCRIPT = 500

Since our dataset is quite large, we use pseudo-epoch instead of the actual epoch, and 1111 pseudo-epoch is set to be 3000300030003000 training steps. Note that the choice of these hyperparameters are inspired by Yao et al. (2024); Li (2023), and if computation resources permitted, a grid search might find better hyperparameters settings.

Configuration in ablation studies. Configuration of different models are shown in table 9. Note that we use base learning rate 3e-2 for 430M model instead of 5e-2 because the latter gave a NaN error.

Params codebook dim Trm hidden dim FFN dim Trm layers Base LR Update Steps
830M 2048 2048 8192 16 5e-2 50k
430M 2048 2048 8192 8 3e-2 50k
120M 1024 1024 4196 8 5e-2 50k
Table 9: Hyperparameters settings for the different model sizes. Trm stands for Transformer.

Task and Data for ablation studies. The evaluation task is masked reconstruction, where for each utterance, we randomly select a span of length 1111 to 15151515 words to mask, and ask VoiceCraft to reconstruct the masked speech based on the transcript and unmasked speech. We use a 1000100010001000-utterance random subset of the Gigaspeech validation set, which contains YouTube videos and podcast data. We ensure that each utterance in the subset has a WER lower than 15% when decoded by Whisper medium.en (Radford et al., 2022).

Metrics for ablation studies. Since ground truth is available for masked reconstruction evaluation, in addition to WER (measured from Whisper medium.en’s output), we also measure the mel-cepstral distortion (MCD) Kubichek (1993), F0 distance (F0), and energy distance (Energy) WER and MCD are better correlated with intelligibility of the speech, and F0 and Energy are better correlated with prosody similarity between the generated and ground truth. MCD measures the difference of Mel Frequency Cepstrum Coefficients (MFCC) between generated and ground truth, defined as

MCD=10ln1012i=1L(migmir)2MCD101012superscriptsubscript𝑖1𝐿superscriptsubscriptsuperscript𝑚𝑔𝑖subscriptsuperscript𝑚𝑟𝑖2\text{MCD}=\frac{10}{\ln 10}\sqrt{\frac{1}{2}\sum_{i=1}^{L}(m^{g}_{i}-m^{r}_{i% })^{2}}MCD = divide start_ARG 10 end_ARG start_ARG roman_ln 10 end_ARG square-root start_ARG divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ( italic_m start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_m start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG

where L𝐿Litalic_L is the order of MFCC, which we set to be 13131313. migsubscriptsuperscript𝑚𝑔𝑖m^{g}_{i}italic_m start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the i𝑖iitalic_ith MFCC of ground truth recording and mirsubscriptsuperscript𝑚𝑟𝑖m^{r}_{i}italic_m start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the i𝑖iitalic_ith MFCC of the generated. We use pymcd package 101010https://github.com/chenqi008/pymcd for calculating MCD. For F0 estimation, we use the pYIN (Mauch and Dixon, 2014) algorithm implemented in librosa McFee et al. (2015) with minimal frequency 80hz and maximal frequency 600hz. For energy calculation, we use the root mean square of magnitude of spectrogram, which is extracted using short time Fourier transform with window length of 640640640640, hop size of 160160160160. Note that since generated speech might have a different length compared to ground truth, dynamic time wrapping is first applied to time aligned the extracted MFCC/F0/energy before calculating their euclidean distances. For each model in the ablation study, we use 3333 different random seeds and report the averaged results.

Scaling FluentSpeech. The original FluentSpeech (Jiang et al., 2023b) is trained on LibriTTS, and we made our best effort in scaling it for a fair comparison. Taking guidance from the authors of FluentSpeech. We scale the batch size from 16161616 utterances to 256256256256 utterances. Diffusion base hidden dimension from 320320320320 to 1024102410241024, residual layers from 20202020 layers to 30303030 layers, residual channels from 256256256256 to 512512512512. The final model contains 330M parameters, which is roughly the same as the Voicebox model Le et al. (2023). The model was trained on Gigaspeech training set on 1 A40 GPU for 626k steps which took 10101010 days. The HiFi-GAN vocoder is also retrained on Gigaspeech training set for 400k steps using hyperparameters used on Voicebox (Le et al., 2023) (they also use Hifi-GAN as vocoder to decode to 16kHz speech)

Baselines for zero-shot TTS. For zero-shot TTS, we compare our VoiceCraft with VALL-E Wang et al. (2023a), XTTS v2 (COQUI, 2023), YourTTS (Casanova et al., 2021), and FluentSpeech. Since the original VALL-E is not open-sourced, we use the code from the popular open-source implementation by Li (2023), and also trained the model on Gigaspeech. Both the AR and NAR model are trained for 50k steps using the ScaledAdam optimizer and Eden scheduler, same as our VoiceCraft. The commercial model XTTS v2 is composed of three modules, VQ-VAE (van den Oord et al., 2017) for speech tokenization, a GPT-2 (Radford et al., 2019) model for speech token modeling and a customized HiFi-GAN (Kong et al., 2020) model for token to waveform generation. XTTS v2 is trained on a mixture of publicly available data and web-crawled data, but the exact data sources are unknown. YourTTS is a zero-shot TTS model built upon the adversarial VAE model VITS (Kim et al., 2021), with novel zero-shot multi-speaker and multilingual training. The model is trained on VCTK, LibriTTS, and also French and Portugese corpora. The FluentSpeech model we used for TTS is the same as in speech editing, as the model can be configured to do zero-shot TTS similar to Voicebox (Le et al., 2023).

Licenses of the speech corpora. Licenses: LibriTTS: CC BY 4.0; Gigaspeech: Apache-2.0; Spotify Podcast dataset: CC BY 4.0.

Appendix D The Conditional Independence Assumption

To better explain the rational behind the conditional independent assumption in equation 3, we go back to sequence Y𝑌Yitalic_Y produced by causal masking. The assumption we are making for equation 3 to hold is equivalent to the assumption that given W𝑊Witalic_W and Hs,tsubscript𝐻𝑠𝑡H_{s,t}italic_H start_POSTSUBSCRIPT italic_s , italic_t end_POSTSUBSCRIPT, Ys,t,ksubscript𝑌𝑠𝑡𝑘Y_{s,t,k}italic_Y start_POSTSUBSCRIPT italic_s , italic_t , italic_k end_POSTSUBSCRIPT is independent of Is,t,k(1)superscriptsubscript𝐼𝑠𝑡𝑘1I_{s,t,k}^{(1)}italic_I start_POSTSUBSCRIPT italic_s , italic_t , italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT and Is,t,k(2)superscriptsubscript𝐼𝑠𝑡𝑘2I_{s,t,k}^{(2)}italic_I start_POSTSUBSCRIPT italic_s , italic_t , italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT defined as

Is,t,k(1)superscriptsubscript𝐼𝑠𝑡𝑘1\displaystyle I_{s,t,k}^{(1)}italic_I start_POSTSUBSCRIPT italic_s , italic_t , italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT (Ys,t+k1,1,Ys,t+k2,2,,Ys,t+1,k1)absentsubscript𝑌𝑠𝑡𝑘11subscript𝑌𝑠𝑡𝑘22subscript𝑌𝑠𝑡1𝑘1\displaystyle\triangleq(Y_{s,t+k-1,1},Y_{s,t+k-2,2},\cdots,Y_{s,t+1,k-1})≜ ( italic_Y start_POSTSUBSCRIPT italic_s , italic_t + italic_k - 1 , 1 end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_s , italic_t + italic_k - 2 , 2 end_POSTSUBSCRIPT , ⋯ , italic_Y start_POSTSUBSCRIPT italic_s , italic_t + 1 , italic_k - 1 end_POSTSUBSCRIPT )
Is,t,k(2)superscriptsubscript𝐼𝑠𝑡𝑘2\displaystyle I_{s,t,k}^{(2)}italic_I start_POSTSUBSCRIPT italic_s , italic_t , italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT (Ys,t1,k+1,Ys,t2,k+2,,Ys,tK+k,K)absentsubscript𝑌𝑠𝑡1𝑘1subscript𝑌𝑠𝑡2𝑘2subscript𝑌𝑠𝑡𝐾𝑘𝐾\displaystyle\triangleq(Y_{s,t-1,k+1},Y_{s,t-2,k+2},\cdots,Y_{s,t-K+k,K})≜ ( italic_Y start_POSTSUBSCRIPT italic_s , italic_t - 1 , italic_k + 1 end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_s , italic_t - 2 , italic_k + 2 end_POSTSUBSCRIPT , ⋯ , italic_Y start_POSTSUBSCRIPT italic_s , italic_t - italic_K + italic_k , italic_K end_POSTSUBSCRIPT )

We argue that this assumption is mild, because 1) Is,t,k(1)superscriptsubscript𝐼𝑠𝑡𝑘1I_{s,t,k}^{(1)}italic_I start_POSTSUBSCRIPT italic_s , italic_t , italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT are tokens from timestep after t𝑡titalic_t and therefore should have less impact on the distribution of Yt,ksubscript𝑌𝑡𝑘Y_{t,k}italic_Y start_POSTSUBSCRIPT italic_t , italic_k end_POSTSUBSCRIPT given past tokens Hs,tsubscript𝐻𝑠𝑡H_{s,t}italic_H start_POSTSUBSCRIPT italic_s , italic_t end_POSTSUBSCRIPT (Hs,tsubscript𝐻𝑠𝑡H_{s,t}italic_H start_POSTSUBSCRIPT italic_s , italic_t end_POSTSUBSCRIPT might also contain also future tokens in physical time if Zs,tsubscript𝑍𝑠𝑡Z_{s,t}italic_Z start_POSTSUBSCRIPT italic_s , italic_t end_POSTSUBSCRIPT is in the masked spans); 2) although Is,t,k(2)superscriptsubscript𝐼𝑠𝑡𝑘2I_{s,t,k}^{(2)}italic_I start_POSTSUBSCRIPT italic_s , italic_t , italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT are tokens from timestep before t𝑡titalic_t, they are from codebooks that are later than codebook k𝑘kitalic_k in the residual quantization chain, meaning that they model the residual left by codebook k𝑘kitalic_k (at the corresponding timesteps). Given the fact that {Ys,t1,k,Ys,t2,k+1,,Ys,tK+k,K1}Hs,tsubscript𝑌𝑠𝑡1𝑘subscript𝑌𝑠𝑡2𝑘1subscript𝑌𝑠𝑡𝐾𝑘𝐾1subscript𝐻𝑠𝑡\{Y_{s,t-1,k},Y_{s,t-2,k+1},\cdots,Y_{s,t-K+k,K-1}\}\subset H_{s,t}{ italic_Y start_POSTSUBSCRIPT italic_s , italic_t - 1 , italic_k end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_s , italic_t - 2 , italic_k + 1 end_POSTSUBSCRIPT , ⋯ , italic_Y start_POSTSUBSCRIPT italic_s , italic_t - italic_K + italic_k , italic_K - 1 end_POSTSUBSCRIPT } ⊂ italic_H start_POSTSUBSCRIPT italic_s , italic_t end_POSTSUBSCRIPT111111A weaker condition holds for the first K tokens in unmasked spans (which accounts for at most 0.08s of speech for our models), but we omit the discussion here for simplicity, meaning that the “fitted parts” are given, and therefore the “unfitted parts” (which is the residual) should have miner impact on the distribution of Ys,t,ksubscript𝑌𝑠𝑡𝑘Y_{s,t,k}italic_Y start_POSTSUBSCRIPT italic_s , italic_t , italic_k end_POSTSUBSCRIPT. Empirically, MusicGen shows that a codec language model trained with the Delay Pattern enjoys the efficiency of the naive parallel pattern, while achieving similar modeling performance as completely flattened sequence.

Appendix E Instructions for human listening test

Screenshots of instructions for the human listening test we used on Amazon Mechanical Turk is shown in figure 8 (speech editing - intelligibility), figure 9(speech editing - naturalness), figure 10 (speech editing - side-by-side comparison), figure 11 (zero-shot TTS - intelligibility), figure 12 (zero-shot TTS - speaker similarity), figure 13 (zero-shot TTS - naturalness). For speech editing evaluation, 64646464 Turkers participated and we paid 474.3474.3474.3474.3 USD in total; for zero-shot TTS evaluation, 59595959 Turkers participated and we paid 457.6457.6457.6457.6 USD. We only allow Turkers who are resident of the U.S. to do the tasks, and the goal is to increase the probability of Turkers being native English speakers. We acknowledge that this is a perfect approach and might need to bias in judgement, but since Amazon Mechanical Turk doesn’t allow selection on native language, this is the best approach we could think of as a proxy to constraining the native language.

Refer to caption
Figure 8: Instruction for speech editing-intelligibility preference. Each task contains 5 recordings. Since the first paragraph is also presented in all other tasks in the instruction page, we only show it in this screenshot.
Refer to caption
Figure 9: Instruction for speech editing-naturalness preference. Each task contains 5 recordings.
Refer to caption
Figure 10: Instruction for speech editing, side-by-side naturalness preference. Each task contains 3 pairs of recordings.
Refer to caption
Figure 11: Instruction for zero-shot TTS, intelligibility preference. Each task contains 5 recordings.
Refer to caption
Figure 12: Instruction for zero-shot TTS, speaker similarity preference. Each task contains 3 pairs.
Refer to caption
Figure 13: Instruction for zero-shot TTS, naturalness preference. Each task contains 5 recordings.