[samples] Despite recent progress in generative adversarial network(GAN)-based vocoders, where the model generates raw waveform conditioned on mel spectrogram, it is still challenging to synthesize high-fidelity audio for numerous speakers across varied recording environments.
In this work, we present BigVGAN, a universal vocoder that generalizes well under various unseen conditions in zero-shot setting. We introduce periodic nonlinearities [Snake] and anti-aliased representation into the generator, which brings the desired inductive bias for waveform synthesis and substantially improves audio quality.
Based on our improved generator and the state-of-the-art discriminators, we train our 14m parameter GAN vocoder up to 112M parameters, which is unprecedented in the literature. In particular, we identify and address the training instabilities specific to such scale, while maintaining high-fidelity output without over-regularization. Our BigVGAN achieves the state-of-the-art zero-shot performance for various out-of-distribution scenarios, including new speakers, novel languages, singing voices, music and instrumental audio in unseen (even noisy) recording environments.
We will release our code and model at:
…Furthermore, it synthesizes 24 kHz high-fidelity speech 44.72× faster than real-time on a GPU.
…3.4 BigVGAN with Large Scale Training: In this subsection, we set out to explore the limits of universal neural vocoding by scaling up the generator’s model capacity to its maximum while maintaining the stability of GAN training and practical usability as a high-speed neural vocoder. We start scaling up the model with our improved generator using the comparable V1 configuration of HiFi-GAN with 14M parameters, which is denoted as BigVGAN-base. We grow BigVGAN-base by increasing the number of upsampling blocks and convolution channels for each block. The BigVGAN-base upsamples the signal by 256× using 4 upsampling blocks with the ratio of
[8, 8, 2, 2]. Each upsampling block is accompanied by multiple residual layers with dilated convolutions, i.e., the AMP module. We further divides the 256× upsampling into 6 blocks
[4, 4, 2, 2, 2, 2] for more fine-grained feature refinement. In addition, we increase the number of channels of AMP module (analogous to MRF in HiFi-GAN) from 512 to 1,536. We denote the model with 1,536 channels and 112M parameters as BigVGAN.
We found that the default learning rate of 2 × 10−4 used in HiFi-GAN causes an early training collapse for BigVGAN training, where the losses from the discriminator submodules immediately converge to zero after several thousands of iterations. Halving the learning rate to 1 × 10−4 was able to reduce such failures. We also found that large batch size is helpful to reduce mode collapse, as it covers more modes per batch. We simply double the batch size from the usual 16 to 32 for a good trade-off between training efficiency and stability, as neural vocoders can require millions of steps to converge. Our required batch size is much smaller than the observed one for large scale GAN training in image domain (ie. 32 vs. 2,048), probably thanks to the strong conditioner in neural vocoding.
In addition to above efforts, we have explored other directions, including various ways to improve the model architecture, spectral normalization to stabilize GAN training, and data augmentation to improve model generalization. Unfortunately, all these trials resulted in worse perceptual quality in our study. The details can be found in Appendix C. We hope these practical lessons that we have learned would be useful to future research endeavors.