Paper: arXiv
Authors: Dayong Li, Xian Li, Xiaofei Li.
Abstract: Zero-shot voice conversion (VC) is to convert speech from one speaker to a target speaker while preserving the original linguistic information, given only one reference speech clip of the unseen target speaker. This work proposes a new VC model, and its key idea is to conduct thorough speaker and content disentanglement by adopting an advanced speech encoder plus vector quantization (VQ) as a content encoder, and an advanced speaker encoder for accurate speaker embedding. In addition, we propose a perceptual loss, a speaker constrative loss and an adversarial loss to compensate the content imperfection caused by VQ and to further improve the speech quality/intelligibility. Overall, the proposed model uses only unsupervised features/losses, and achieves excellent VC performance in terms of both speech quality/intelligibility and speaker similarity, for both seen and unseen speakers.
Random sampled from VCTK datasets.
source | target | vqmivc | s3prl | s2vc | dvqvc |
---|---|---|---|---|---|
1: | |||||
2: | |||||
3: | |||||
4: | |||||
5: | |||||
6: | |||||
7: | |||||
8: | |||||
9: | |||||
10: | |||||
11: | |||||
12: | |||||
13: | |||||
14: | |||||
15: | |||||
16: | |||||
17: | |||||
18: | |||||
19: | |||||
20: | |||||
Random sampled from VCC2020 datasets.
source | target | vqmivc | s3prl | s2vc | dvqvc |
---|---|---|---|---|---|
1: | |||||
2: | |||||
3: | |||||
4: | |||||
5: | |||||
6: | |||||
7: | |||||
8: | |||||
9: | |||||
10: | |||||
11: | |||||
12: | |||||
13: | |||||
14: | |||||
15: | |||||
16: | |||||
17: | |||||
18: | |||||
19: | |||||
20: | |||||