VTTS: Visual-Text To Speech: The University of Tokyo, Japan. Nara Institute of Science and Technology, Japan
VTTS: Visual-Text To Speech: The University of Tokyo, Japan. Nara Institute of Science and Technology, Japan
This paper proposes visual-text to speech (vTTS), a method for Speech synthesis model Speech synthesis model
synthesizing speech from visual text (i.e., text as an image).
Conventional TTS converts phonemes or characters into dis- Text features Visual features
crete symbols and synthesizes a speech waveform from them,
thus losing the visual features that the characters essentially 19 16 5 5 8 Visual feature extractor
arXiv:2203.14725v1 [cs.SD] 28 Mar 2022
Figure 2: Overall architecture of the proposed method. Figure 3: Example of visual text. Structure and combination of
sub-characters determine the overall reading (top). Emphatic
(middle) and typeface-decorated (bottom) text evoke emphasis
proposed method. Section 4.1 notes the results of preliminary and emotion, respectively.
experiments on this change.
Some studies use information other than text such as refer-
ence voices [12], face images [13], and articulation [14]. These FastSpeech2 is a non-autoregressive TTS utilizing a duration-
methods have an additional encoder for such information. How- based upsampler, we must take the temporal alignment between
ever, similar architecture is not suitable for vTTS. Since visual visual text and a speech feature sequence. Therefore, we use vi-
text is not discrete as described in Section 1, text and visual in- sual text with monospace fonts in this work. Each character is of
formation should not be treated separately. Therefore, we pro- a specified width w, height h, and font size f s. Therefore, char-
pose a model that jointly treats text and visual information in acters of length n are transformed into an image whose width is
Section 3. nw and height is h. We extract sliced images from it using slid-
Although a character is a input unit of speech synthesis that ing windows similar to a previous study [27]. The window is of
does not rely on language knowledge, some studies use other a specified width wc and height h extracted at intervals s(= w).
units, e.g., phrase [15], word [16], subword [17], and byte [18] Here, c represents the number of characters in one sliced image,
in descending order of length. In this paper, we use a pixel, and adjusting c leads to a change in the preceding and follow-
which is a smaller input unit than ever before. We can observe a ing characters to be considered when obtaining visual features
similar trend in natural language processing as described below. from each piece of visual text. For example, if c = 3, convolu-
2.2. Visual text for natural language processing (NLP) tion is performed including one character before and after. By
In NLP, the input unit of language models and machine transla- considering the number of preceding and following characters,
tion models has been examined. Starting with phrases [19] and reading and prosody can be acquired that depend on the neigh-
words [20] and moving through subwords [21], characters [22], boring characters. Blank images are padded to the left and right
and sub-characters [23], some research proposes the use of pix- edges to create sliced images for the number of characters. Fi-
els (i.e., visual text). nally, we generate n sliced images of height h and width wc
There is one study that extracted character embeddings from text.
from visual text [24]. This study utilized logogram composi- 3.3. Visual features
tionality, allowing for better processing of rare words. Simi- We extract visual features from the sliced images generated in
larly, there are studies that have used character compositional- Section 3.2. The basic TTS uses the output from a text feature
ity in a word segmentation task [25] or a sentiment classification extractor. In comparison, vTTS uses the output of a visual fea-
task [26]. Other than that, the use of sliced visual text, which is ture extractor instead. The architecture of the extractor is shown
analogous to subword text symbols, leads to there being more in Figure 2(b) and is inspired by a machine translation study us-
resistance to various types of noise [27]. These studies mainly ing visual text [27]. The extractor mainly comprises 2D convo-
used a CNN to extract the visual information from visual text. lution, 2D batch normalization [28], and the activation function
These studies are based on the compositionality in lo- ReLU [29]. Finally, we apply a linear layer to the reshaped
gograms and acquire meaning-related features. In contrast, our output to obtain a visual feature for each character. The visual
study aims to acquire sound-related ones contained in charac- features are fed to the FastSpeech2-inspired encoder.
ters, especially in phonograms. The visual feature extractor utilizes the following informa-
tion. Figure 3 shows examples of visual text with the following
3. Method information.
3.1. Architecture comparison between vTTS and TTS
As for the architecture of the proposed vTTS, we replace • Compositionality: In phonetic languages (e.g., Korean), the
the character embedding layer (i.e., text feature extractor) of character or combination of sub-characters determines the
the original FastSpeech2 with a visual feature extractor (Fig- overall reading. The visual feature extractor acquires the
ure 2(a)). This replacement allows for extracting visual features correspondence between a character (or sub-character) and
from visual text. The subsequent encoder, variance adapter, and the reading. Therefore, even if OOV and rare characters
decoder are the same as the original FastSpeech2. To investigate emerge, we expect our vTTS to be able to accurately pre-
the ideal performance of vTTS, in this paper, we use artificially dict the readings using the visual features of the characters or
generated visual text rather than that in in-the-wild images (e.g., sub-characters.
advertisements and comics). • Emphasis attribute: The visual feature extractor distin-
3.2. Generating visual text guishes emphasized typefaces (underline, bold, italic) from
The first step is to transform text to grayscale sliced images. The normal typefaces. This enables vTTS to synthesize word-
overall process is illustrated in Figure 2(c). Since the original emphasized speech only from emphasized typefaces.
• Emotional attribute: The visual feature extractor acquires Table 1: Results of MOS and 95% confidence interval for nat-
the correspondence between typeface and emotion, and it re- uralness. Bold means significantly better score than TTS. For
flects the emotion in speech in accordance with the typeface. all languages, vTTS was comparable to or better than TTS in
naturalness.
4. Experiments and Results Lang. TTS
vTTS
c=1 c=3 c=5
4.1. Experimental Setup ja 3.45 ± 0.09 3.41 ± 0.09 3.46 ± 0.09 3.49 ± 0.10
ko 3.04 ± 0.16 3.55 ± 0.15 3.18 ± 0.15 3.01 ± 0.15
Language. We evaluated the performance of the proposed en 3.72 ± 0.10 3.69 ± 0.10 3.70 ± 0.11 3.71 ± 0.10
method on three phonetic languages: Japanese (Hiragana), Ko-
rean (Hangul), and English (Roman Alphabet). For Japanese,
we converted Katakana and Kanji into 80 Hiragana-characters ison is not the main scope of this paper, but deep investigation
in advance. In Hiragana, one character corresponds to the spe- will be our future work.
cific sound. For Korean, we used 1226 Hangul-characters. In
4.2. Results
Hungul, the combination of jamo determines the sound. For
English, we applied lowercasing to facilitate training. 4.2.1. Naturalness comparison between vTTS and TTS
Dataset. We used the JSUT [30] (Japanese), KSS [31] (Ko- To compare our vTTS with conventional TTS, we conducted a
rean), and LJSpeech [32] (English) corpora for naturalness mean opinion score (MOS) evaluation on the naturalness of the
comparison of vTTS and TTS as described in Section 4.2.1. synthetic speech. We used three settings for vTTS, c = 1, 3, 5,
The training (validation) data sizes for these languages were to investigate the impact of the window size. Native speakers
8.3 (0.61), 9.0 (0.38), and 19 (0.89) hours, respectively. The of each language participated in the evaluation. 150 and 200
test data sizes were 100 samples for all languages. We used speakers recruited using our crowdsourcing system listened to
the JECS corpus (see our project page), which contains noun- 20 speech utterances in Japanese and English, respectively. In
emphasized Japanese speech, for the emphasis experiments de- contrast, 10 speakers recruited by snow-ball sampling6 listened
scribed in Section 4.2.2. The training, validation, and test data to 160 speech utterances in Korean. The text content was kept
sizes were 0.412 hours, 0.03 hours, and 50 samples, respec- consistent among different methods so that all listeners exam-
tively. We used the manga2voice corpus [33], which contains ined only naturalness without other interfering factors.
emotional speech, for the emotion experiment described in Sec- The results are shown in Table 1. For all languages, the
tion 4.2.3. The training, validation, and test data sizes were naturalness of vTTS was comparable to or better than that of
0.062 hours, 0.003 hours, and 50 samples, respectively. Since TTS. Comparing the highest score of vTTS (c = 5 for Japanese
the training data sizes of the JECS and manga2voice corpora and English, c = 1 for Korean) with TTS, although there was
were small, we fine-tuned a vTTS model pretrained using the no significant difference for Japanese and English, there was a
JSUT corpus. We used the KSS corpus in Section 4.2.4. The significant difference for Korean (p < 0.05). This means that
data sizes were almost the same but slightly different from those the extraction of visual features was effective for Hangul, whose
in Section 4.2.1. See Section 4.2.4 for details. All audio was readings are determined by the combination of sub-characters.
downsampled to 22.05 kHz. The alignment between visual text It can also be seen that the appropriate window size differed
and acoustic features was obtained through forced alignment of depending on the language. For English, there was no signifi-
visual text and a character sequence, and a character sequence cant difference between c = 1, 3, 5. However, for Japanese, the
and a acoustic feature sequence. naturalness with a sliding window of c = 5 was slightly better
Visual text. We used pygame3 to generate visual text from text. than that with c = 1 (p < 0.10), while for Korean, c = 1 was
For Japanese and English, we used the IPA typeface; for Ko- significantly better than c = 5 (p < 0.001). One possible rea-
rean, the Gowun Batang typeface. Visual text of w = 30 and son is that the number of phonemes expressed by one character
h = 30 was used for all languages. f s was set to 15 for differs depending on the language. The following experiments
Japanese and Korean and 20 for English. This is the result of used the best c value for each language.
setting the text to be the appropriate size for an image.
Model configuration. The model size and hyperparameters of 4.2.2. Emphasis attribute
the original FastSpeech2 and FastSpeech2-inspired architecture We evaluated whether our vTTS can transfer the emphasis at-
in vTTS followed the open-source implementation [34]4 . In the tribute in visual text to speech. We trained the vTTS model
visual feature extractor, the 2D convolution used only one chan- using pairs of word-emphasized speech and visual text. We
nel, and there was a padding of 1, kernel size of 3, and stride used three types of word emphasis methods (underline, bold,
of 1 [27]. The dimension of the output of the visual feature and italic) as shown in Figure 3(b). We trained the vTTS
extractor was set to 256, the same as the encoder hidden dimen- model using each type, i.e., three vTTS models were trained.
sion of FastSpeech2. We generated speech waveforms from mel The trained models try to synthesize speech with emphasis
spectrograms using the pretrained HiFi-GAN [35]5 . placed on specified words. We prepared synthetic speech of not
Preliminary experiment. As described in Section 2.1, we used only emphasis groups (“Emphasis (*)”) but also control groups
characters instead of phonemes as input unit for FastSpeech2. (“Control (*)”). Their difference was in the input visual text:
We conducted a subjective evaluation of this substitution under word-emphasized visual text in the emphasis groups and non-
the same conditions as in Section 4.2.1. The results showed that emphasized text in the control groups.
there was no significant difference in naturalness in Japanese For the listening test, listeners listened to speech randomly
and English. On the other hand, the use of phonemes was better selected from seven kinds (three emphasis groups, three control
than that of characters and visual text in Korean. This compar- groups, and ground-truth natural speech) of speech utterances.
The listeners answered with which word was emphasized, from
3 https://www.pygame.org/news
4 https://github.com/Wataru-Nakata/FastSpeech2-JSUT 6 We could not collect enough number of native listeners in our
5 https://github.com/jik876/hifi-gan crowdsourcing system
Table 2: Accuracy of perceived emphasized words. Our method Table 4: Results of MOS and 95% confidence interval on nat-
can properly reflect emphasis attribute in visual text in speech. uralness. ∆ denotes MOS decrease from “in-vocab.” Our
Speech Accuracy method can synthesize more natural speech than TTS.
Ground truth 0.960 in-vocab rare (∆) OOV (∆)
Emphasis (underline) 0.933 TTS 3.29 ± 0.16 2.32 ± 0.16 (−0.97) 2.31 ± 0.20 (−0.98)
vTTS 3.58 ± 0.13 3.12 ± 0.16 (−0.46) 2.95 ± 0.21 (−0.63)
Emphasis (bold) 0.898
Emphasis (italic) 0.877
Control (italic) 0.505 Table 5: Results on CER. ∆ denotes CER increase from “in-
Control (bold) 0.400 vocab.” Our method can synthesize more intelligible speech
Control (underline) 0.381
than TTS.
Table 3: Confusion matrix between reference (true *) and per- in-vocab rare (∆) OOV (∆)
ceived (pred *) emotion. Our method can properly reflect emo- TTS 0.120 0.194 (+0.074) 0.255 (+0.135)
tion attributes in visual text in synthetic speech. vTTS 0.080 0.114 (+0.034) 0.163 (+0.083)