Skip navigation
Please use this identifier to cite or link to this item: http://repository.iitr.ac.in/handle/123456789/21736
Title: A Generative Adversarial Network Based Ensemble Technique for Automatic Evaluation of Machine Synthesized Speech
Authors: Jaiswal J.
Chaubey A.
Bhimavarapu S.K.R.
Kashyap S.
Kumar P.
Raman, Balasubramanian
Pratim Roy, Partha
Palaiahnakote S.
Sanniti di Baja G.
Wang L.
Yan W.Q.
Published in: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
5th Asian Conference on Pattern Recognition, ACPR 2019
Abstract: In this paper, we propose a method to automatically compute a speech evaluation metric, Virtual Mean Opinion Score (vMOS) for the speech generated by Text-to-Speech (TTS) models to analyse its human-ness. In contrast to the currently used manual speech evaluation techniques, the proposed method uses an end-to-end neural network to calculate vMOS which is qualitatively similar to manually obtained Mean Opinion Score (MOS). The Generative Adversarial Network (GAN) and a binary classifier have been trained on real natural speech with known MOS. Further, the vMOS has been calculated by averaging the scores obtained by the two networks. In this work, the input to GAN’s discriminator is conditioned with the speech generated by off-the-shelf TTS models so as to get closer to the natural speech. It has been shown that the proposed model can be trained with a minimum amount of data as its objective is to generate only the evaluation score and not speech. The proposed method has been tested to evaluate the speech synthesized by state-of-the-art TTS models and it has reported the vMOS of 0.6675, 0.4945 and 0.4890 for Wavenet2, Tacotron and Deepvoice3 respectively while the vMOS for natural speech is 0.6682 on a scale from 0 to 1. These vMOS scores correspond to and are qualitatively explained by their manually calculated MOS scores. © 2020, Springer Nature Switzerland AG.
Citation: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2020), 12047 LNCS: 580-593
URI: https://doi.org/10.1007/978-3-030-41299-9_45
http://repository.iitr.ac.in/handle/123456789/21736
Issue Date: 2020
Publisher: Springer
Keywords: Automatic speech evaluation
Binary classifier
Conditional GAN
Text-to-speech
Virtual Mean Opinion Score
ISBN: 9.78E+12
ISSN: 3029743
Author Scopus IDs: 57215650673
57215655760
57215667887
57215662859
57200340329
23135470700
56880478500
Author Affiliations: Jaiswal, J., Department of Computer Science and Engineering, Indian Institute of Technology Roorkee, Roorkee, 247667, India
Chaubey, A., Department of Computer Science and Engineering, Indian Institute of Technology Roorkee, Roorkee, 247667, India
Bhimavarapu, S.K.R., Department of Computer Science and Engineering, Indian Institute of Technology Roorkee, Roorkee, 247667, India
Kashyap, S., Department of Computer Science and Engineering, Indian Institute of Technology Roorkee, Roorkee, 247667, India
Kumar, P., Department of Computer Science and Engineering, Indian Institute of Technology Roorkee, Roorkee, 247667, India
Raman, B., Department of Computer Science and Engineering, Indian Institute of Technology Roorkee, Roorkee, 247667, India
Roy, P.P., Department of Computer Science and Engineering, Indian Institute of Technology Roorkee, Roorkee, 247667, India
Corresponding Author: Jaiswal, J.; Department of Computer Science and Engineering, India; email: jjaynil@cs.iitr.ac.in
Appears in Collections:Conference Publications [CS]

Files in This Item:
There are no files associated with this item.
Show full item record


Items in Repository are protected by copyright, with all rights reserved, unless otherwise indicated.