0% found this document useful (0 votes)
16 views5 pages

VTTS: Visual-Text To Speech: The University of Tokyo, Japan. Nara Institute of Science and Technology, Japan

This paper introduces visual-text to speech (vTTS), a novel method for synthesizing speech from visual text images, which preserves visual features lost in traditional text-to-speech (TTS) systems. vTTS utilizes a convolutional neural network to extract visual features and generates acoustic features using a FastSpeech2-inspired model, demonstrating comparable or superior naturalness, emotional transfer, and intelligibility for unseen characters compared to conventional TTS. The implementation and corpus for vTTS are open-sourced to encourage reproducibility and further research.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views5 pages

VTTS: Visual-Text To Speech: The University of Tokyo, Japan. Nara Institute of Science and Technology, Japan

This paper introduces visual-text to speech (vTTS), a novel method for synthesizing speech from visual text images, which preserves visual features lost in traditional text-to-speech (TTS) systems. vTTS utilizes a convolutional neural network to extract visual features and generates acoustic features using a FastSpeech2-inspired model, demonstrating comparable or superior naturalness, emotional transfer, and intelligibility for unseen characters compared to conventional TTS. The implementation and corpus for vTTS are open-sourced to encourage reproducibility and further research.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

vTTS: visual-text to speech

Yoshifumi Nakano1 , Takaaki Saeki1 , Shinnosuke Takamichi1 ,


Katsuhito Sudoh2 , Hiroshi Saruwatari1
1
The University of Tokyo, Japan. 2 Nara Institute of Science and Technology, Japan.
[email protected], shinnosuke [email protected]

Abstract Speech waveform Speech waveform

This paper proposes visual-text to speech (vTTS), a method for Speech synthesis model Speech synthesis model
synthesizing speech from visual text (i.e., text as an image).
Conventional TTS converts phonemes or characters into dis- Text features Visual features
crete symbols and synthesizes a speech waveform from them,
thus losing the visual features that the characters essentially 19 16 5 5 8 Visual feature extractor
arXiv:2203.14725v1 [cs.SD] 28 Mar 2022

have. Therefore, our method synthesizes speech not from dis-


“s p e e c h”
crete symbols but from visual text. The proposed vTTS extracts
visual features with a convolutional neural network and then (a) TTS: text input (b) vTTS: visual text input
generates acoustic features with a non-autoregressive model in-
spired by FastSpeech2. Experimental results show that 1) vTTS Figure 1: Comparison of speech synthesis methods. Whereas
is capable of generating speech with naturalness comparable to the basic TTS uses discrete text features, our vTTS uses visual
or better than a conventional TTS, 2) it can transfer emphasis features.
and emotion attributes in visual text to speech without addi-
tional labels and architectures, and 3) it can synthesize more speech by recognizing the difference of the typeface of visual
natural and intelligible speech from unseen and rare characters text itself whereas the basic TTS requires additional labels and
than conventional TTS. architectures [9, 10]. Second, our vTTS does not require a
Index Terms: speech synthesis, visual text, visual-text to vocabulary (lookup table) of characters, unlike the basic TTS.
speech, visual feature A pre-determined fixed-size vocabulary cannot be used to em-
bed out-of-vocabulary (OOV) characters (typically, “unknown”
1. Introduction symbol is used). Our vTTS avoids this and obtains visual fea-
tures from OOV characters. Last, for some languages, our
Text to speech (TTS) is a method for synthesizing speech from
model can efficiently predict the reading from character compo-
arbitrary text. Along with the development of deep neural
sitionality (the reading or meaning is determined by the struc-
networks (DNNs), neural TTS has succeeded in synthesizing
ture of the character). Even if rare and OOV characters emerge,
speech with the same naturalness as human speech [1, 2, 3, 4].
vTTS can synthesize more natural and intelligible speech utiliz-
Generally, TTS first converts each character (or each phoneme
ing character compositionality than the basic TTS.
based on linguistic knowledge) into a discrete symbol and syn-
We compared our vTTS with the basic TTS in three pho-
thesizes a speech waveform from it (Figure 1(a)).
netic languages: Japanese (Hiragana), Korean (Hangul), and
However, when we read a sentence out loud, we do not
English (Roman Alphabet) and then revealed the following
view each character as a discrete symbol but rather use the vi-
properties.
sual information of the character. For example, when we read a
phonogram (a character representing a speech sound), the char- • Our vTTS is comparable to or better than the basic TTS in
acter or the combination of sub-characters (components of one terms of naturalness of synthetic speech.
character) determines the reading. In addition, we can recog- • It can transfer emphasis and emotion attributes in visual text
nize attributes in visual text and reflect them in speech. For to speech without additional labels and architectures.
example, when reading a textbook, we can recognize an under- • It can synthesize more natural and intelligible speech from
lined (or bold, italic) word as important and read the word em- OOV and rare phonograms than the basic TTS.
phatically [5]. Another example is typeface, which sometimes Our implementation1 and corpus2 are open-sourced on the
evokes a certain emotion [6, 7]. Advertisements and comics uti- project pages for reproducibility and future work.
lize this to convey desired emotions to readers [8]. From these
facts, it is more appropriate to use the visual information of text 2. Related Work
to synthesize speech. 2.1. Text to speech (TTS)
In this paper, we propose visual-text to speech (vTTS), As the basic TTS model, we use FastSpeech2 [11], a well-
which generates a speech waveform from visual text (Fig- known non-autoregressive model. FastSpeech2 consists of three
ure 1(b)). Our vTTS consists of a visual feature extractor and components: an encoder, variance adapter, and decoder. The in-
FastSpeech2-inspired model. The visual feature extractor ex- put of the original FastSpeech2 is a phoneme sequence, but we
tracts visual features from visual text using a convolutional neu- use a character sequence instead for fair comparison with the
ral network (CNN). The FastSpeech2-inspired model predicts
acoustic features from the extracted visual features. 1 https://github.com/Yoshifumi-Nakano/visual-text-to-speech
Our vTTS has three advantages over conventional TTS. 2 https://sites.google.com/site/shinnosuketakamichi/research-
First, vTTS can reflect emphasis and emotion attributes in topics/jecs corpus
Mel
spectrogram
● Compositionality
(n,e) …
wc
Decoder
Positional
Linear
강 (kang) = ㄱ (k) + ㅏ (a) + ㅇ (ng)
(n,wch) n
Encoding
Variance Reshape s ● Emphasis attribute
adapter (n,wc,h)
Encoder Conv2d
Positional Batch Norm w
Encoding Underline Bold Italic
ReLU h
Visual feature ● Emotion attribute
(n,wc,h)
extractor
Sliced
“speech”
Visual text images
Aiharahudemozikaisyo (sad) Koruri (joy)
(a) Overall architecture (b) Visual feature extractor (c) Sliced visual text

Figure 2: Overall architecture of the proposed method. Figure 3: Example of visual text. Structure and combination of
sub-characters determine the overall reading (top). Emphatic
(middle) and typeface-decorated (bottom) text evoke emphasis
proposed method. Section 4.1 notes the results of preliminary and emotion, respectively.
experiments on this change.
Some studies use information other than text such as refer-
ence voices [12], face images [13], and articulation [14]. These FastSpeech2 is a non-autoregressive TTS utilizing a duration-
methods have an additional encoder for such information. How- based upsampler, we must take the temporal alignment between
ever, similar architecture is not suitable for vTTS. Since visual visual text and a speech feature sequence. Therefore, we use vi-
text is not discrete as described in Section 1, text and visual in- sual text with monospace fonts in this work. Each character is of
formation should not be treated separately. Therefore, we pro- a specified width w, height h, and font size f s. Therefore, char-
pose a model that jointly treats text and visual information in acters of length n are transformed into an image whose width is
Section 3. nw and height is h. We extract sliced images from it using slid-
Although a character is a input unit of speech synthesis that ing windows similar to a previous study [27]. The window is of
does not rely on language knowledge, some studies use other a specified width wc and height h extracted at intervals s(= w).
units, e.g., phrase [15], word [16], subword [17], and byte [18] Here, c represents the number of characters in one sliced image,
in descending order of length. In this paper, we use a pixel, and adjusting c leads to a change in the preceding and follow-
which is a smaller input unit than ever before. We can observe a ing characters to be considered when obtaining visual features
similar trend in natural language processing as described below. from each piece of visual text. For example, if c = 3, convolu-
2.2. Visual text for natural language processing (NLP) tion is performed including one character before and after. By
In NLP, the input unit of language models and machine transla- considering the number of preceding and following characters,
tion models has been examined. Starting with phrases [19] and reading and prosody can be acquired that depend on the neigh-
words [20] and moving through subwords [21], characters [22], boring characters. Blank images are padded to the left and right
and sub-characters [23], some research proposes the use of pix- edges to create sliced images for the number of characters. Fi-
els (i.e., visual text). nally, we generate n sliced images of height h and width wc
There is one study that extracted character embeddings from text.
from visual text [24]. This study utilized logogram composi- 3.3. Visual features
tionality, allowing for better processing of rare words. Simi- We extract visual features from the sliced images generated in
larly, there are studies that have used character compositional- Section 3.2. The basic TTS uses the output from a text feature
ity in a word segmentation task [25] or a sentiment classification extractor. In comparison, vTTS uses the output of a visual fea-
task [26]. Other than that, the use of sliced visual text, which is ture extractor instead. The architecture of the extractor is shown
analogous to subword text symbols, leads to there being more in Figure 2(b) and is inspired by a machine translation study us-
resistance to various types of noise [27]. These studies mainly ing visual text [27]. The extractor mainly comprises 2D convo-
used a CNN to extract the visual information from visual text. lution, 2D batch normalization [28], and the activation function
These studies are based on the compositionality in lo- ReLU [29]. Finally, we apply a linear layer to the reshaped
gograms and acquire meaning-related features. In contrast, our output to obtain a visual feature for each character. The visual
study aims to acquire sound-related ones contained in charac- features are fed to the FastSpeech2-inspired encoder.
ters, especially in phonograms. The visual feature extractor utilizes the following informa-
tion. Figure 3 shows examples of visual text with the following
3. Method information.
3.1. Architecture comparison between vTTS and TTS
As for the architecture of the proposed vTTS, we replace • Compositionality: In phonetic languages (e.g., Korean), the
the character embedding layer (i.e., text feature extractor) of character or combination of sub-characters determines the
the original FastSpeech2 with a visual feature extractor (Fig- overall reading. The visual feature extractor acquires the
ure 2(a)). This replacement allows for extracting visual features correspondence between a character (or sub-character) and
from visual text. The subsequent encoder, variance adapter, and the reading. Therefore, even if OOV and rare characters
decoder are the same as the original FastSpeech2. To investigate emerge, we expect our vTTS to be able to accurately pre-
the ideal performance of vTTS, in this paper, we use artificially dict the readings using the visual features of the characters or
generated visual text rather than that in in-the-wild images (e.g., sub-characters.
advertisements and comics). • Emphasis attribute: The visual feature extractor distin-
3.2. Generating visual text guishes emphasized typefaces (underline, bold, italic) from
The first step is to transform text to grayscale sliced images. The normal typefaces. This enables vTTS to synthesize word-
overall process is illustrated in Figure 2(c). Since the original emphasized speech only from emphasized typefaces.
• Emotional attribute: The visual feature extractor acquires Table 1: Results of MOS and 95% confidence interval for nat-
the correspondence between typeface and emotion, and it re- uralness. Bold means significantly better score than TTS. For
flects the emotion in speech in accordance with the typeface. all languages, vTTS was comparable to or better than TTS in
naturalness.
4. Experiments and Results Lang. TTS
vTTS
c=1 c=3 c=5
4.1. Experimental Setup ja 3.45 ± 0.09 3.41 ± 0.09 3.46 ± 0.09 3.49 ± 0.10
ko 3.04 ± 0.16 3.55 ± 0.15 3.18 ± 0.15 3.01 ± 0.15
Language. We evaluated the performance of the proposed en 3.72 ± 0.10 3.69 ± 0.10 3.70 ± 0.11 3.71 ± 0.10
method on three phonetic languages: Japanese (Hiragana), Ko-
rean (Hangul), and English (Roman Alphabet). For Japanese,
we converted Katakana and Kanji into 80 Hiragana-characters ison is not the main scope of this paper, but deep investigation
in advance. In Hiragana, one character corresponds to the spe- will be our future work.
cific sound. For Korean, we used 1226 Hangul-characters. In
4.2. Results
Hungul, the combination of jamo determines the sound. For
English, we applied lowercasing to facilitate training. 4.2.1. Naturalness comparison between vTTS and TTS
Dataset. We used the JSUT [30] (Japanese), KSS [31] (Ko- To compare our vTTS with conventional TTS, we conducted a
rean), and LJSpeech [32] (English) corpora for naturalness mean opinion score (MOS) evaluation on the naturalness of the
comparison of vTTS and TTS as described in Section 4.2.1. synthetic speech. We used three settings for vTTS, c = 1, 3, 5,
The training (validation) data sizes for these languages were to investigate the impact of the window size. Native speakers
8.3 (0.61), 9.0 (0.38), and 19 (0.89) hours, respectively. The of each language participated in the evaluation. 150 and 200
test data sizes were 100 samples for all languages. We used speakers recruited using our crowdsourcing system listened to
the JECS corpus (see our project page), which contains noun- 20 speech utterances in Japanese and English, respectively. In
emphasized Japanese speech, for the emphasis experiments de- contrast, 10 speakers recruited by snow-ball sampling6 listened
scribed in Section 4.2.2. The training, validation, and test data to 160 speech utterances in Korean. The text content was kept
sizes were 0.412 hours, 0.03 hours, and 50 samples, respec- consistent among different methods so that all listeners exam-
tively. We used the manga2voice corpus [33], which contains ined only naturalness without other interfering factors.
emotional speech, for the emotion experiment described in Sec- The results are shown in Table 1. For all languages, the
tion 4.2.3. The training, validation, and test data sizes were naturalness of vTTS was comparable to or better than that of
0.062 hours, 0.003 hours, and 50 samples, respectively. Since TTS. Comparing the highest score of vTTS (c = 5 for Japanese
the training data sizes of the JECS and manga2voice corpora and English, c = 1 for Korean) with TTS, although there was
were small, we fine-tuned a vTTS model pretrained using the no significant difference for Japanese and English, there was a
JSUT corpus. We used the KSS corpus in Section 4.2.4. The significant difference for Korean (p < 0.05). This means that
data sizes were almost the same but slightly different from those the extraction of visual features was effective for Hangul, whose
in Section 4.2.1. See Section 4.2.4 for details. All audio was readings are determined by the combination of sub-characters.
downsampled to 22.05 kHz. The alignment between visual text It can also be seen that the appropriate window size differed
and acoustic features was obtained through forced alignment of depending on the language. For English, there was no signifi-
visual text and a character sequence, and a character sequence cant difference between c = 1, 3, 5. However, for Japanese, the
and a acoustic feature sequence. naturalness with a sliding window of c = 5 was slightly better
Visual text. We used pygame3 to generate visual text from text. than that with c = 1 (p < 0.10), while for Korean, c = 1 was
For Japanese and English, we used the IPA typeface; for Ko- significantly better than c = 5 (p < 0.001). One possible rea-
rean, the Gowun Batang typeface. Visual text of w = 30 and son is that the number of phonemes expressed by one character
h = 30 was used for all languages. f s was set to 15 for differs depending on the language. The following experiments
Japanese and Korean and 20 for English. This is the result of used the best c value for each language.
setting the text to be the appropriate size for an image.
Model configuration. The model size and hyperparameters of 4.2.2. Emphasis attribute
the original FastSpeech2 and FastSpeech2-inspired architecture We evaluated whether our vTTS can transfer the emphasis at-
in vTTS followed the open-source implementation [34]4 . In the tribute in visual text to speech. We trained the vTTS model
visual feature extractor, the 2D convolution used only one chan- using pairs of word-emphasized speech and visual text. We
nel, and there was a padding of 1, kernel size of 3, and stride used three types of word emphasis methods (underline, bold,
of 1 [27]. The dimension of the output of the visual feature and italic) as shown in Figure 3(b). We trained the vTTS
extractor was set to 256, the same as the encoder hidden dimen- model using each type, i.e., three vTTS models were trained.
sion of FastSpeech2. We generated speech waveforms from mel The trained models try to synthesize speech with emphasis
spectrograms using the pretrained HiFi-GAN [35]5 . placed on specified words. We prepared synthetic speech of not
Preliminary experiment. As described in Section 2.1, we used only emphasis groups (“Emphasis (*)”) but also control groups
characters instead of phonemes as input unit for FastSpeech2. (“Control (*)”). Their difference was in the input visual text:
We conducted a subjective evaluation of this substitution under word-emphasized visual text in the emphasis groups and non-
the same conditions as in Section 4.2.1. The results showed that emphasized text in the control groups.
there was no significant difference in naturalness in Japanese For the listening test, listeners listened to speech randomly
and English. On the other hand, the use of phonemes was better selected from seven kinds (three emphasis groups, three control
than that of characters and visual text in Korean. This compar- groups, and ground-truth natural speech) of speech utterances.
The listeners answered with which word was emphasized, from
3 https://www.pygame.org/news
4 https://github.com/Wataru-Nakata/FastSpeech2-JSUT 6 We could not collect enough number of native listeners in our
5 https://github.com/jik876/hifi-gan crowdsourcing system
Table 2: Accuracy of perceived emphasized words. Our method Table 4: Results of MOS and 95% confidence interval on nat-
can properly reflect emphasis attribute in visual text in speech. uralness. ∆ denotes MOS decrease from “in-vocab.” Our
Speech Accuracy method can synthesize more natural speech than TTS.
Ground truth 0.960 in-vocab rare (∆) OOV (∆)
Emphasis (underline) 0.933 TTS 3.29 ± 0.16 2.32 ± 0.16 (−0.97) 2.31 ± 0.20 (−0.98)
vTTS 3.58 ± 0.13 3.12 ± 0.16 (−0.46) 2.95 ± 0.21 (−0.63)
Emphasis (bold) 0.898
Emphasis (italic) 0.877
Control (italic) 0.505 Table 5: Results on CER. ∆ denotes CER increase from “in-
Control (bold) 0.400 vocab.” Our method can synthesize more intelligible speech
Control (underline) 0.381
than TTS.
Table 3: Confusion matrix between reference (true *) and per- in-vocab rare (∆) OOV (∆)
ceived (pred *) emotion. Our method can properly reflect emo- TTS 0.120 0.194 (+0.074) 0.255 (+0.135)
tion attributes in visual text in synthetic speech. vTTS 0.080 0.114 (+0.034) 0.163 (+0.083)

pred happy pred sad


true happy 0.795 0.205
true sad 0.114 0.886 4.2.4. Speech synthesis from rare and OOV characters
Finally, we investigated the ability of vTTS to synthesize speech
from OOV and rare characters. A good example for this purpose
word (specifically, noun) candidates. 150 native speakers lis- is Korean because the character compositionality determines the
tened to 20 speech utterances. The average number of candi- reading. We designed three test sets: in-vocab, rare, and OOV.
dates per sentence was 3.6; the chance rate was 0.299. The “in-vocab” set consisted of only characters appearing more
The results are shown in Table 2. For all emphasis meth- than three times in the training data. Sentences in the “rare”
ods, the accuracy of the emphasis groups was significantly bet- set included characters appearing less than three times in the
ter than that of the control groups (p < 0.01). This shows that training data, and those in the “OOV” set included OOV char-
our vTTS properly recognized the emphasis attribute in visual acters. Each test set consisted of 50 sentences. For comparison,
text. In addition, the accuracy of the emphasis groups was close we trained the TTS model under the same conditions as for the
to that of the ground truth. There was no significant differ- vTTS. The TTS model had an “unknown” symbol, and an OOV
ence between two emphasis methods (underline, bold) and the character was encoded in the symbol. We conducted a MOS
ground truth. These results show that these two emphasis meth- test on naturalness and a dictation test. The listener listened to
ods can accurately transfer the emphasis attribute in visual text speech utterances randomly selected from all test sets and then
to speech, without additional emphasis labels and architectures. answered the speech naturalness in the MOS test and dictated
One possible reason for the lower accuracy of the italic input the text in the dictation test. The character error rate (CER) was
compared with the two other emphasis methods is that the shape used in the dictation test. 10 listeners participated in each test.
of italic is not as similar to that of normal typeface because of Each listener listened to 160 and 64 speech utterances in the
geometric transformation. MOS test and dictation test, respectively.
4.2.3. Emotion attribute The MOS results on naturalness are shown in Table 4.
We evaluated whether our method can transfer emotions evoked There was a significant difference between vTTS and TTS in
by certain typefaces to speech. We trained the vTTS model the “in-vocab,” “rare,” and “OOV” sets (p < 0.001). For both
using pairs of emotional speech and typeface-decorated visual TTS and vTTS, the scores decreased from “in-vocab” to “rare”
text. Before the training, we conducted preliminary experi- and “OOV,” but the changes in vTTS were smaller than those
ments to find typefaces that evoke emotions. As a result, we in TTS. The dictation results for intelligibility are shown in Ta-
decided to use the Koruri typeface7 and Aiharahudemozikaisyo ble 5. The overall tendency was the same as the MOS test.
typeface8 , which evoke joy and sadness, respectively. These are There was a significant difference between TTS and vTTS in
typefaces that more than 70 percent of evaluators thought of as the “in-vocab,” “rare,” and “OOV” sets (p < 0.001). Also, the
corresponding emotion. CER increase for vTTS was also smaller than that for TTS.
We trained one vTTS model using the above two typefaces. From these results, we can say that our vTTS can capture
In the listening test, listeners listened to the synthetic speech character compositionality for speech synthesis and synthesize
and answered the perceived emotion. To avoid the listener from intelligible speech even from OOV and rare characters.
perceiving emotion from the linguistic content, the content was
common between emotions, and it was selected from the “neu-
5. Conclusions
tral” subset of the manga2voice corpus. 120 listeners partici- In this paper, we proposed vTTS, which generates speech wave-
pated, and each listener listened to 20 speech utterances. forms from visual text. Experimental results showed that 1)
the basic performance of vTTS is comparable to or better than
The results are shown in Table 3. When listening to speech
that of TTS, 2) it can reflect several attributes in visual text in
synthesized from the Koruri typeface, 79% of the listeners an-
speech, and 3) it can generate more natural speech from OOV
swered that it sounded joyful, and 89% of the listeners answered
and rare characters. In future work, we plan to apply our method
that it sounded sad when listening to speech synthesized from
to a variety of media such as manga or posters. We hope that
the Aiharahudemozikaisyo typeface. These scores are compa-
this research will serve as a first benchmark in the development
rable to TTS with emotion labels [9]. These results show that
of speech synthesis from visual text.
vTTS can accurately transfer the emotion attribute in visual text
to speech, without additional emotion labels and architectures. Acknowledgements: This work was supported by JST,
Moonshot R&D Grant Number JPMJPS2011 (for multi-
7 https://github.com/Koruri/Koruri/releases/tag/20210720 language experiments), and JSPS KAKENHI 21H05054,
8 https://faraway.work/font.html 19H01116, 21H04900 (for basic technique).
6. References [18] B. Li, Y. Zhang, T. Sainath, Y. Wu, and W. Chan, “Bytes are all
you need: End-to-end multilingual speech recognition and syn-
[1] A. v. d. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, thesis with bytes,” in Proc. ICASSP, May. 2019, pp. 5621–5625.
A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu,
“WaveNet: A generative model for raw audio,” arXiv preprint [19] R. Zens, F. J. Och, and H. Ney, “Phrase-based statistical machine
arXiv:1609.03499, Sep. 2016. translation,” in Proc. Annual Conference on Artificial Intelligence,
Sep. 2002, pp. 18–32.
[2] Y. Wang, R. J. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss,
N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio, Q. V. Le, [20] P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, “Enrich-
Y. Agiomyrgiannakis, R. A. J. Clark, and R. A. Saurous, ing word vectors with subword information,” arXiv preprint
“Tacotron: Towards end-to-end speech synthesis,” in Proc. IN- arXiv:1607.04606, Jul. 2016.
TERSPEECH, Aug. 2017, pp. 4006–4010. [21] R. Sennrich, B. Haddow, and A. Birch, “Neural machine transla-
[3] J. Shen, R. Pang, R. Weiss, M. Schuster, N. Jaitly, Z. Yang, tion of rare words with subword units,” in Proc. Association for
Z. Chen, Y. Zhang, Y. Wang, R. Skerrv-Ryan, R. Saurous, Computational Linguistics, Aug. 2016, pp. 1715–1725.
Y. Agiomvrgiannakis, and Y. Wu, “Natural TTS synthesis by [22] H. El Boukkouri, O. Ferret, T. Lavergne, H. Noji, P. Zweigen-
conditioning WaveNet on mel spectrogram predictions,” in Proc. baum, and J. Tsujii, “CharacterBERT: Reconciling ELMo and
ICASSP, Apr. 2018, pp. 4779–4783. BERT for word-level open-vocabulary representations from char-
[4] Y. Ren, Y. Ruan, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T.-Y. Liu, acters,” in Proc. COLING, Dec. 2020.
“Fastspeech: Fast, robust and controllable text to speech,” in Proc. [23] L. Zhang and M. Komachi, “Neural machine translation of logo-
NeurIPS, May. 2019, pp. 3165–3174. graphic language using sub-character level information,” in Proc.
the Third Conference on Machine Translation, Oct. 2018, pp. 17–
[5] H. Strobelt, D. Oelke, B. C. Kwon, T. Schreck, and H. Pfister,
25.
“Guidelines for effective usage of text highlighting techniques,”
IEEE Transactions on Visualization and Computer Graphics, pp. [24] F. Liu, H. Lu, C. Lo, and G. Neubig, “Learning character-level
489–498, May. 2016. compositionality with visual features,” in Proc. ACL, Jul. 2017,
pp. 2059–2068.
[6] S. Choi, T. Yamasaki, and K. Aizawa, “Typeface emotion analysis
for communication on mobile messengers,” in Proc. the 1st Inter- [25] F. Dai and Z. Cai, “Glyph-aware embedding of Chinese charac-
national Workshop on Multimedia Alternate Realities, Oct. 2016, ters,” in Proc. EMNLP, Sep 2017, pp. 64–69.
pp. 37–40. [26] B. Sun, L. Yang, P. Dong, W. Zhang, J. Dong, and C. Young,
[7] J. Caldwell, “Japanese typeface personalities: Are typeface per- “Super characters: A conversion from sentiment classification to
sonalities consistent across culture?” in Proc. IEEE International image classification,” in Proc. EMNLP, Oct. 2018, p. 309–315.
Professonal Communication 2013 Conference, Jul. 2013, pp. 1–8. [27] E. Salesky, D. Etter, and M. Post, “Robust open-vocabulary trans-
[8] K. Nonaka, J. Saito, and S. Nakamura, “Music video clip impres- lation from visual text representations,” in Proc. EMNLP, Nov.
sion emphasis method by font fusion synchronized with music,” in 2021, pp. 7235–7252.
Proc. Entertainment Computing and Serious Games, Nov. 2019, [28] S. Ioffe and C. Szegedy, “Batch Normalization: Accelerating deep
pp. 146–157. network training by reducing internal covariate shift,” in Proc.
[9] R. Liu, B. Sisman, and H. Li, “Reinforcement learning for emo- ICML, Jul. 2015, pp. 448–456.
tional text-to-speech synthesis with improved emotion discrim- [29] V. Nair and G. E. Hinton, “Rectified linear units improve restricted
inability,” in Proc. INTERSPEECH, Aug. 2021, pp. 4648–4652. boltzmann machines,” in Proc. ICML, Jun. 2010, pp. 807–814.
[10] C. Cui, Y. Ren, J. Liu, F. Chen, R. Huang, M. Lei, and Z. Zhao, [30] R. Sonobe, S. Takamichi, and H. Saruwatari, “JSUT corpus: free
“EMOVIE: A mandarin emotion speech dataset with a simple large-scale Japanese speech corpus for end-to-end speech synthe-
emotional text-to-speech model,” in Proc. INTERSPEECH, Aug. sis,” arXiv preprint arXiv:1711.00354, Oct. 2017.
2021, pp. 2766–2770. [31] K. Park, “KSS dataset: Korean single speaker
[11] Y. Ren, C. Hu, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T.-Y. Liu, speech dataset,” https://kaggle.com/bryanpark/
“FastSpeech 2: Fast and high-quality end-to-end text to speech,” korean-single-speaker-speech-dataset, 2018.
in Proc. ICLR, Sep. 2021. [32] K. Ito and L. Johnson, “The LJSpeech dataset,” https://keithito.
[12] Y. Wang, D. Stanton, Y. Zhang, R. J. Skerry-Ryan, E. Battenberg, com/LJ-Speech-Dataset/, 2017.
J. Shor, Y. Xiao, F. Ren, Y. Jia, and R. A. Saurous, “Style tokens: [33] S. Takamichi, Y. Saito, T. Nakamura, T. Kooriyama, and
Unsupervised style modeling, control and transfer in end-to-end H. Saruwatari, “manga2voice: speech analysis towards audio syn-
speech synthesis,” in Proc. ICML, Mar. 2018, p. 5167–5176. thesis from comic image,” in Proc. 2020 Spring meeting of the
[13] S. Goto, K. Onishi, Y. Saito, K. Tachibana, and K. Mori, Acoustical Society of Japan, Mar. 2020.
“Face2speech: Towards multi-speaker text-to-speech synthesis [34] C.-M. Chien, J.-H. Lin, C.-y. Huang, P.-c. Hsu, and H.-y. Lee, “In-
using an embedding vector predicted from a face image,” in Proc. vestigating on incorporating pretrained and learnable speaker rep-
INTERSPEECH, Oct. 2020. resentations for multi-speaker multi-style text-to-speech,” in Proc.
[14] T. G. Csap’o, L. V. T’oth, G. Gosztolya, and A. Mark’o, “Speech ICASSP, Jun. 2021, pp. 8588–8592.
synthesis from text and ultrasound tongue image-based articula- [35] J. Kong, J. Kim, and J. Bae, “HiFi-gan: Generative adversarial
tory input,” arXiv preprint arXiv:2107.02003, Jul. 2021. networks for efficient and high fidelity speech synthesis,” in Proc.
[15] Y. Hono, K. Tsuboi, K. Sawada, K. Hashimoto, K. Oura, NeurIPS, Oct. 2020, pp. 17 022–17 033.
Y. Nankaku, and K. Tokuda, “Hierarchical Multi-Grained Gen-
erative Model for Expressive Speech Synthesis,” in Proc. INTER-
SPEECH, Oct. 2020, pp. 3441–3445.
[16] G. Sun, Y. Zhang, R. J. Weiss, Y. Cao, H. Zen, and Y. Wu, “Fully-
hierarchical fine-grained prosody modeling for interpretable
speech synthesis,” in Proc. ICASSP, May. 2020, pp. 6264–6268.
[17] M. Aso, S. Takamichi, N. Takamune, and H. Saruwatari, “Acous-
tic model-based subword tokenization and prosodic-context ex-
traction without language knowledge for text-to-speech synthe-
sis,” Speech Communication, vol. 125, pp. 53–60, Dec. 2020.

You might also like