Audio samples from "DVQVC: AN UNSUPERVISED ZERO-SHOT VOICE CONVERSION FRAMEWORK"

Paper: arXiv

Authors: Dayong Li, Xian Li, Xiaofei Li.

Abstract: Zero-shot voice conversion (VC) is to convert speech from one speaker to a target speaker while preserving the original linguistic information, given only one reference speech clip of the unseen target speaker. This work proposes a new VC model, and its key idea is to conduct thorough speaker and content disentanglement by adopting an advanced speech encoder plus vector quantization (VQ) as a content encoder, and an advanced speaker encoder for accurate speaker embedding. In addition, we propose a perceptual loss, a speaker constrative loss and an adversarial loss to compensate the content imperfection caused by VQ and to further improve the speech quality/intelligibility. Overall, the proposed model uses only unsupervised features/losses, and achieves excellent VC performance in terms of both speech quality/intelligibility and speaker similarity, for both seen and unseen speakers.

Comparison among systems: seen senarios.

Random sampled from VCTK datasets.

source target vqmivc s3prl s2vc dvqvc
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
16:
17:
18:
19:
20:

Comparison among systems: unseen senarios.

Random sampled from VCC2020 datasets.

source target vqmivc s3prl s2vc dvqvc
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
16:
17:
18:
19:
20: