Deep learning speech synthesis

Deep learning speech synthesis uses Deep Neural Networks (DNN) to produce artificial speech from text (text-to-speech) or spectrum (vocoder). The deep neural networks are trained using a large amount of recorded speech and, in the case of a text-to-speech system, the associated labels and/or input text.

Some DNN-based speech synthesizers are approaching the naturalness of the human voice.

Formulation

Given an input text or some sequence of linguistic unit $Y$ , the target speech $X$ can be derived by

$X=\arg \max P(X|Y,\theta )$

where $\theta$ is the model parameter.

Typically, the input text will first be passed to an acoustic feature generator, then the acoustic features are passed to the neural vocoder. For the acoustic feature generator, the Loss function is typically L1 or L2 loss. These loss functions impose a constraint that the output acoustic feature distributions must be Gaussian or Laplacian. In practice, since the human voice band ranges from approximately 300 to 4000 Hz, the loss function will be designed to have more penalty on this range:

$loss=\alpha {\text{loss}}_{\text{human}}+(1-\alpha ){\text{loss}}_{\text{other}}$

where ${\text{loss}}_{\text{human}}$ is the loss from human voice band and $\alpha$ is a scalar typically around 0.5. The acoustic feature is typically Spectrogram or spectrogram in Mel scale. These features capture the time-frequency relation of speech signal and thus, it is sufficient to generate intelligent outputs with these acoustic features. The Mel-frequency cepstrum feature used in the speech recognition task is not suitable for speech synthesis because it reduces too much information.

Brief history

In September 2016, DeepMind proposed WaveNet, a deep generative model of raw audio waveforms. This shows the community that deep learning-based models have the capability to model raw waveforms and perform well on generating speech from acoustic features like spectrograms or spectrograms in mel scale, or even from some preprocessed linguistic features. In early 2017, Mila (research institute) proposed char2wav, a model to produce raw waveform in an end-to-end method. Also, Google and Facebook proposed Tacotron and VoiceLoop, respectively, to generate acoustic features directly from the input text. In the later in the same year, Google proposed Tacotron2 which combined the WaveNet vocoder with the revised Tacotron architecture to perform end-to-end speech synthesis. Tacotron2 can generate high-quality speech approaching the human voice. Since then, end-to-end methods have become the hottest research topic because many researchers around the world have started to notice the power of end-to-end speech synthesizers.

Advantages and disadvantages

The advantages of end-to-end methods are as follows:

Only need a single model to perform text analysis, acoustic modeling and audio synthesis, i.e. synthesizing speech directly from characters
Less feature engineering
Easily allows for rich conditioning on various attributes, e.g. speaker or language
Adaptation to new data is easier
More robust than multi-stage models because no component's error can compound
Powerful model capacity to capture the hidden internal structures of data
Capable of generating intelligible and natural speech
No need to maintain a large database, i.e. small footprint

Despite the many advantages mentioned, end-to-end methods still have many challenges to be solved:

Auto-regressive-based models suffer from slow inference problem
Output speech are not robust when data are not sufficient
Lack of controllability compared with traditional concatenative and statistical parametric approaches
Tend to learn the flat prosody by averaging over training data
Tend to output smoothed acoustic features because the l1 or l2 loss is used

Challenges

- Slow inference problem

To solve the slow inference problem, Microsoft research and Baidu research both proposed using non-auto-regressive models to make the inference process faster. The FastSpeech model proposed by Microsoft use Transformer architecture with a duration model to achieve the goal. In addition, the duration model, which borrows from traditional methods, makes the speech production more robust.

- Robustness problem

Researchers have found that the robustness problem is strongly related to text alignment failures, and this finding has driven many researchers to revise the attention mechanism, which utilizes strong local relations and the monotonic properties of speech.

- Controllability problem

To solve the controllability problem, much work on variational auto-encoders has been proposed.^[1]^[2]

- Flat prosody problem

GST-Tacotron can slightly alleviate the flat prosody problem; however, it still depends heavily on training data.

- Smoothed acoustic output problem

To generate more realistic acoustic features, the GAN learning strategy can be applied.

However, in practice, neural vocoders can generalize well, even when the input features are smoother than real data.

Semi-supervised learning

Currently, self-supervised learning has gained much attention through better use of unlabelled data. Research^[3]^[4] has shown that, with the aid of self-supervised loss, the need for paired data decreases.

Zero-shot speaker adaptation

Zero-shot speaker adaptation is promising because a single model can generate speech with various speaker styles and characteristic. In June 2018, Google proposed to use pre-trained speaker verification models as speaker encoders to extract speaker embeddings.^[5] The speaker encoders then become part of the neural text-to-speech models, so that it can determine the style and characteristics of the output speech. This procedure has shown the community that it is possible to use only a single model to generate speech with multiple styles.

Neural vocoder

Speech synthesis example using the HiFi-GAN neural vocoder

In deep learning-based speech synthesis, neural vocoders play an important role in generating high-quality speech from acoustic features. The WaveNet model proposed in 2016 achieves excellent performance on speech quality. Wavenet factorised the joint probability of a waveform $\mathbf {x} =\{x_{1},...,x_{T}\}$ as a product of conditional probabilities as follows

$p_{\theta }(\mathbf {x} )=\prod _{t=1}^{T}p(x_{t}|x_{1},...,x_{t-1})$

where $\theta$ is the model parameter including many dilated convolution layers. Thus, each audio sample $x_{t}$ is conditioned on the samples at all previous timesteps. However, the auto-regressive nature of WaveNet makes the inference process dramatically slow. To solve this problem, Parallel WaveNet^[6] was proposed. Parallel WaveNet is an inverse autoregressive flow-based model which is trained by knowledge distillation with a pre-trained teacher WaveNet model. Since such inverse autoregressive flow-based models are non-auto-regressive when performing inference, the inference speed is faster than real-time. Meanwhile, Nvidia proposed a flow-based WaveGlow^[7] model, which can also generate speech faster than real-time. However, despite the high inference speed, parallel WaveNet has the limitation of needing a pre-trained WaveNet model, so that WaveGlow takes many weeks to converge with limited computing devices. This issue has been solved by Parallel WaveGAN^[8], which learns to produce speech through multi-resolution spectral loss and GAN learning strategies.

Synthesis example

The Chaos synthesized by VITS, a research deep-learning-based end-to-end text-to-speech method, using the LJ Speech dataset.

Problems playing this file? See media help.

References

^ Hsu, Wei-Ning (2018). "Hierarchical Generative Modeling for Controllable Speech Synthesis". arXiv:1810.07217 [cs.CL].
^ Habib, Raza (2019). "Semi-Supervised Generative Modeling for Controllable Speech Synthesis". arXiv:1910.01709 [cs.CL].
^ Chung, Yu-An (2018). "Semi-Supervised Training for Improving Data Efficiency in End-to-End Speech Synthesis". arXiv:1808.10128 [cs.CL].
^ Ren, Yi (2019). "Almost Unsupervised Text to Speech and Automatic Speech Recognition". arXiv:1905.06791 [cs.CL].
^ Jia, Ye (2018). "Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis". arXiv:1806.04558 [cs.CL].
^ van den Oord, Aaron (2018). "Parallel WaveNet: Fast High-Fidelity Speech Synthesis". arXiv:1711.10433 [cs.CL].
^ Prenger, Ryan (2018). "WaveGlow: A Flow-based Generative Network for Speech Synthesis". arXiv:1811.00002 [cs.SD].
^ Yamamoto, Ryuichi (2019). "Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram". arXiv:1910.11480 [eess.AS].

[1] Hsu, Wei-Ning (2018). "Hierarchical Generative Modeling for Controllable Speech Synthesis". arXiv:1810.07217 [cs.CL].

[2] Habib, Raza (2019). "Semi-Supervised Generative Modeling for Controllable Speech Synthesis". arXiv:1910.01709 [cs.CL].

[3] Chung, Yu-An (2018). "Semi-Supervised Training for Improving Data Efficiency in End-to-End Speech Synthesis". arXiv:1808.10128 [cs.CL].

[4] Ren, Yi (2019). "Almost Unsupervised Text to Speech and Automatic Speech Recognition". arXiv:1905.06791 [cs.CL].

[5] Jia, Ye (2018). "Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis". arXiv:1806.04558 [cs.CL].

[6] van den Oord, Aaron (2018). "Parallel WaveNet: Fast High-Fidelity Speech Synthesis". arXiv:1711.10433 [cs.CL].

[7] Prenger, Ryan (2018). "WaveGlow: A Flow-based Generative Network for Speech Synthesis". arXiv:1811.00002 [cs.SD].

[8] Yamamoto, Ryuichi (2019). "Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram". arXiv:1910.11480 [eess.AS].

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]