Paper: arXiv
Authors: Dayong Li, Xian Li, Xiaofei Li.
Abstract: Zero-shot voice conversion (VC) is to convert speech from one speaker to a target speaker while preserving the original linguistic information, given only one reference speech clip of the unseen target speaker. This work proposes a new VC model, and its key idea is to conduct thorough speaker and content disentanglement by adopting an advanced speech encoder plus vector quantization (VQ) as a content encoder, and an advanced speaker encoder for accurate speaker embedding. In addition, we propose a perceptual loss, a speaker constrative loss and an adversarial loss to compensate the content imperfection caused by VQ and to further improve the speech quality/intelligibility. Overall, the proposed model uses only unsupervised features/losses, and achieves excellent VC performance in terms of both speech quality/intelligibility and speaker similarity, for both seen and unseen speakers.
Random sampled from VCTK datasets.
| source | target | vqmivc | s3prl | s2vc | dvqvc |
|---|---|---|---|---|---|
| 1: | |||||
| 2: | |||||
| 3: | |||||
| 4: | |||||
| 5: | |||||
| 6: | |||||
| 7: | |||||
| 8: | |||||
| 9: | |||||
| 10: | |||||
| 11: | |||||
| 12: | |||||
| 13: | |||||
| 14: | |||||
| 15: | |||||
| 16: | |||||
| 17: | |||||
| 18: | |||||
| 19: | |||||
| 20: | |||||
Random sampled from VCC2020 datasets.
| source | target | vqmivc | s3prl | s2vc | dvqvc |
|---|---|---|---|---|---|
| 1: | |||||
| 2: | |||||
| 3: | |||||
| 4: | |||||
| 5: | |||||
| 6: | |||||
| 7: | |||||
| 8: | |||||
| 9: | |||||
| 10: | |||||
| 11: | |||||
| 12: | |||||
| 13: | |||||
| 14: | |||||
| 15: | |||||
| 16: | |||||
| 17: | |||||
| 18: | |||||
| 19: | |||||
| 20: | |||||