0% found this document useful (0 votes)
49 views4 pages

Wave Net

WaveNet is a deep learning model created by DeepMind that can generate raw audio waveforms like human speech and music. It uses a convolutional neural network to model audio one sample at a time. WaveNet was shown to produce more realistic speech than previous text-to-speech systems, though still less natural than human speech. It can be trained on different voices and accents and has the potential to model any kind of audio. DeepMind has since improved WaveNet to allow voice conversion by swapping the identity while maintaining speech characteristics.

Uploaded by

sophia787
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
49 views4 pages

Wave Net

WaveNet is a deep learning model created by DeepMind that can generate raw audio waveforms like human speech and music. It uses a convolutional neural network to model audio one sample at a time. WaveNet was shown to produce more realistic speech than previous text-to-speech systems, though still less natural than human speech. It can be trained on different voices and accents and has the potential to model any kind of audio. DeepMind has since improved WaveNet to allow voice conversion by swapping the identity while maintaining speech characteristics.

Uploaded by

sophia787
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

WaveNet

WaveNet is a deep neural network for generating raw audio. It was created by researchers at London-
based AI firm DeepMind. The technique, outlined in a paper in September 2016,[1] is able to generate
relatively realistic-sounding human-like voices by directly modelling waveforms using a neural network
method trained with recordings of real speech. Tests with US English and Mandarin reportedly showed that
the system outperforms Google's best existing text-to-speech (TTS) systems, although as of 2016 its text-to-
speech synthesis still was less convincing than actual human speech.[2] WaveNet's ability to generate raw
waveforms means that it can model any kind of audio, including music.[3]

History
Generating speech from text is an increasingly common task thanks to the popularity of software such as
Apple's Siri, Microsoft's Cortana, Amazon Alexa and the Google Assistant.[4]

Most such systems use a variation of a technique that involves concatenated sound fragments together to
form recognisable sounds and words.[5] The most common of these is called concatenative TTS.[6] It
consists of large library of speech fragments, recorded from a single speaker that are then concatenated to
produce complete words and sounds. The result sounds unnatural, with an odd cadence and tone.[7] The
reliance on a recorded library also makes it difficult to modify or change the voice.[8]

Another technique, known as parametric TTS,[9] uses mathematical models to recreate sounds that are then
assembled into words and sentences. The information required to generate the sounds is stored in the
parameters of the model. The characteristics of the output speech are controlled via the inputs to the model,
while the speech is typically created using a voice synthesiser known as a vocoder. This can also result in
unnatural sounding audio.

Design and ongoing research

Background

WaveNet is a type of feedforward neural network known as a deep


convolutional neural network (CNN). In WaveNet, the CNN takes
a raw signal as an input and synthesises an output one sample at a
time. It does so by sampling from a softmax (i.e. categorical)
distribution of a signal value that is encoded using μ-law
companding transformation and quantized to 256 possible A stack of dilated casual
values.[11] convolutional layers[10]

Initial concept and results

According to the original September 2016 DeepMind research paper WaveNet: A Generative Model for
Raw Audio,[12] the network was fed real waveforms of speech in English and Mandarin. As these pass
through the network, it learns a set of rules to describe how the audio waveform evolves over time. The
trained network can then be used to create new speech-like waveforms at 16,000 samples per second.
These waveforms include realistic breaths and lip smacks – but do not conform to any language.[13]

WaveNet is able to accurately model different voices, with the accent and tone of the input correlating with
the output. For example, if it is trained with German, it produces German speech.[14] The capability also
means that if the WaveNet is fed other inputs – such as music – its output will be musical. At the time of its
release, DeepMind showed that WaveNet could produce waveforms that sound like classical music.[15]

Content (voice) swapping

According to the June 2018 paper Disentangled Sequential Autoencoder,[16] DeepMind has successfully
used WaveNet for audio and voice "content swapping": the network can swap the voice on an audio
recording for another, pre-existing voice while maintaining the text and other features from the original
recording. "We also experiment on audio sequence data. Our disentangled representation allows us to
convert speaker identities into each other while conditioning on the content of the speech." (p.  5) "For
audio, this allows us to convert a male speaker into a female speaker and vice versa [...]." (p. 1) According
to the paper, a two-digit minimum amount of hours (c. 50 hours) of pre-existing speech recordings of both
source and target voice are required to be fed into WaveNet for the program to learn their individual
features before it is able to perform the conversion from one voice to another at a satisfying quality. The
authors stress that "[a]n advantage of the model is that it separates dynamical from static features [...]."
(p.  8), i. e. WaveNet is capable of distinguishing between the spoken text and modes of delivery
(modulation, speed, pitch, mood, etc.) to maintain during the conversion from one voice to another on the
one hand, and the basic features of both source and target voices that it is required to swap on the other.

The January 2019 follow-up paper Unsupervised speech representation learning using WaveNet
autoencoders[17] details a method to successfully enhance the proper automatic recognition and
discrimination between dynamical and static features for "content swapping", notably including swapping
voices on existing audio recordings, in order to make it more reliable. Another follow-up paper, Sample
Efficient Adaptive Text-to-Speech,[18] dated September 2018 (latest revision January 2019), states that
DeepMind has successfully reduced the minimum amount of real-life recordings required to sample an
existing voice via WaveNet to "merely a few minutes of audio data" while maintaining high-quality results.

Its ability to clone voices has raised ethical concerns about WaveNet's ability to mimic the voices of living
and dead persons. According to a 2016 BBC article, companies working on similar voice-cloning
technologies (such as Adobe Voco) intend to insert watermarking inaudible to humans to prevent
counterfeiting, while maintaining that voice cloning satisfying, for instance, the needs of entertainment-
industry purposes would be of a far lower complexity and use different methods than required to fool
forensic evidencing methods and electronic ID devices, so that natural voices and voices cloned for
entertainment-industry purposes could still be easily told apart by technological analysis.[19]

Applications
At the time of its release, DeepMind said that WaveNet required too much computational processing power
to be used in real world applications.[20] As of October 2017, Google announced a 1,000-fold performance
improvement along with better voice quality. WaveNet was then used to generate Google Assistant voices
for US English and Japanese across all Google platforms.[21] In November 2017, DeepMind researchers
released a research paper detailing a proposed method of "generating high-fidelity speech samples at more
than 20 times faster than real-time", called "Probability Density Distillation".[22] At the annual I/O
developer conference in May 2018, it was announced that new Google Assistant voices were available and
made possible by WaveNet; WaveNet greatly reduced the number of audio recordings that were required to
create a voice model by modeling the raw audio of the voice actor samples.[23]

See also
15.ai
Deep learning speech synthesis

References
1. van den Oord, Aaron; Dieleman, Sander; Zen, Heiga; Simonyan, Karen; Vinyals, Oriol;
Graves, Alex; Kalchbrenner, Nal; Senior, Andrew; Kavukcuoglu, Koray (2016-09-12).
"WaveNet: A Generative Model for Raw Audio". 1609. arXiv:1609.03499 (https://arxiv.org/ab
s/1609.03499). Bibcode:2016arXiv160903499V (https://ui.adsabs.harvard.edu/abs/2016arXi
v160903499V).
2. Kahn, Jeremy (2016-09-09). "Google's DeepMind Achieves Speech-Generation
Breakthrough" (https://www.bloomberg.com/news/articles/2016-09-09/google-s-ai-brainiacs-
achieve-speech-generation-breakthrough). Bloomberg.com. Retrieved 2017-07-06.
3. Meyer, David (2016-09-09). "Google's DeepMind Claims Massive Progress in Synthesized
Speech" (http://fortune.com/2016/09/09/google-deepmind-wavenet-ai/). Fortune. Retrieved
2017-07-06.
4. Kahn, Jeremy (2016-09-09). "Google's DeepMind Achieves Speech-Generation
Breakthrough" (https://www.bloomberg.com/news/articles/2016-09-09/google-s-ai-brainiacs-
achieve-speech-generation-breakthrough). Bloomberg.com. Retrieved 2017-07-06.
5. Condliffe, Jamie (2016-09-09). "When this computer talks, you may actually want to listen" (h
ttps://www.technologyreview.com/s/602343/face-of-a-robot-voice-of-an-angel/). MIT
Technology Review. Retrieved 2017-07-06.
6. Hunt, A. J.; Black, A. W. (May 1996). "Unit selection in a concatenative speech synthesis
system using a large speech database". 1996 IEEE International Conference on Acoustics,
Speech, and Signal Processing Conference Proceedings (https://www.ee.columbia.edu/~dp
we/e6820/papers/HuntB96-speechsynth.pdf) (PDF). Vol. 1. pp. 373–376.
CiteSeerX 10.1.1.218.1335 (https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.218.
1335). doi:10.1109/ICASSP.1996.541110 (https://doi.org/10.1109%2FICASSP.1996.54111
0). ISBN 978-0-7803-3192-1. S2CID 14621185 (https://api.semanticscholar.org/CorpusID:14
621185).
7. Coldewey, Devin (2016-09-09). "Google's WaveNet uses neural nets to generate eerily
convincing speech and music" (https://techcrunch.com/2016/09/09/googles-wavenet-uses-n
eural-nets-to-generate-eerily-convincing-speech-and-music/). TechCrunch. Retrieved
2017-07-06.
8. van den Oord, Aäron; Dieleman, Sander; Zen, Heiga (2016-09-08). "WaveNet: A Generative
Model for Raw Audio" (https://deepmind.com/blog/wavenet-generative-model-raw-audio/).
DeepMind. Retrieved 2017-07-06.
9. Zen, Heiga; Tokuda, Keiichi; Black, Alan W. (2009). "Statistical parametric speech
synthesis". Speech Communication. 51 (11): 1039–1064. CiteSeerX 10.1.1.154.9874 (http
s://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.154.9874).
doi:10.1016/j.specom.2009.04.004 (https://doi.org/10.1016%2Fj.specom.2009.04.004).
S2CID 3232238 (https://api.semanticscholar.org/CorpusID:3232238).
10. van den Oord, Aäron (2017-11-12). "High-fidelity speech synthesis with WaveNet" (https://w
ww.deepmind.com/blog/high-fidelity-speech-synthesis-with-wavenet). DeepMind. Retrieved
2022-06-05.
11. Oord, Aaron van den; Dieleman, Sander; Zen, Heiga; Simonyan, Karen; Vinyals, Oriol;
Graves, Alex; Kalchbrenner, Nal; Senior, Andrew; Kavukcuoglu, Koray (2016-09-12).
"WaveNet: A Generative Model for Raw Audio". 1609. arXiv:1609.03499 (https://arxiv.org/ab
s/1609.03499). Bibcode:2016arXiv160903499V (https://ui.adsabs.harvard.edu/abs/2016arXi
v160903499V).
12. Oord et al. (2016). WaveNet: A Generative Model for Raw Audio (https://arxiv.org/abs/1609.0
3499), Cornell University, 19 September 2016
13. Gershgorn, Dave (2016-09-09). "Are you sure you're talking to a human? Robots are starting
to sounding eerily lifelike" (https://qz.com/778056/google-deepminds-wavenet-algorithm-can
-accurately-mimic-human-voices/). Quartz. Retrieved 2017-07-06.
14. Coldewey, Devin (2016-09-09). "Google's WaveNet uses neural nets to generate eerily
convincing speech and music" (https://techcrunch.com/2016/09/09/googles-wavenet-uses-n
eural-nets-to-generate-eerily-convincing-speech-and-music/). TechCrunch. Retrieved
2017-07-06.
15. van den Oord, Aäron; Dieleman, Sander; Zen, Heiga (2016-09-08). "WaveNet: A Generative
Model for Raw Audio" (https://deepmind.com/blog/wavenet-generative-model-raw-audio/).
DeepMind. Retrieved 2017-07-06.
16. Li & Mand (2016). Disentangled Sequential Autoencoder (https://arxiv.org/abs/1803.02991),
12 June 2018, Cornell University
17. Chorowsky et al. (2019). Unsupervised speech representation learning using WaveNet
autoencoders (https://arxiv.org/abs/1901.08810), 25 January 2019, Cornell University
18. Chen et al. (2018). Sample Efficient Adaptive Text-to-Speech (https://arxiv.org/abs/1809.104
60v1), 27 September 2018, Cornell University. Also see this paper's latest January 2019
revision (https://arxiv.org/abs/1809.10460v3).
19. Adobe Voco 'Photoshop-for-voice' causes concern (https://www.bbc.com/news/technology-3
7899902), 7 November 2016, BBC
20. "Adobe Voco 'Photoshop-for-voice' causes concern" (https://www.bbc.co.uk/news/technolog
y-37899902). BBC News. 2016-11-07. Retrieved 2017-07-06.
21. WaveNet launches in the Google Assistant (https://deepmind.com/blog/wavenet-launches-g
oogle-assistant/)
22. Oord et al. (2017): Parallel WaveNet: Fast High-Fidelity Speech Synthesis (https://arxiv.org/a
bs/1711.10433), Cornell University, 28 November 2017
23. Martin, Taylor (May 9, 2018). "Try the all-new Google Assistant voices right now" (https://ww
w.cnet.com/how-to/how-to-get-all-google-assistants-new-voices-right-now/). CNET.
Retrieved May 10, 2018.

External links
WaveNet: A Generative Model for Raw Audio (https://deepmind.com/blog/wavenet-a-generat
ive-model-for-raw-audio/)

Retrieved from "https://en.wikipedia.org/w/index.php?title=WaveNet&oldid=1166540659"

You might also like