Please use this identifier to cite or link to this item: https://www.um.edu.mt/library/oar/handle/123456789/108015
Title: Autoencoder-based voice conversion for ASR data augmentation
Authors: Spiteri, Julian (2022)
Keywords: Automatic speech recognition
Maltese language
Neural networks (Computer science)
Issue Date: 2022
Citation: Spiteri, J. (2022). Autoencoder-based voice conversion for ASR data augmentation (Master's dissertation).
Abstract: Automatic Speech Recognition (ASR) has progressed rapidly in languages such as English since there are abundant amounts of available data. Understudied languages, such as Maltese, in common with other under-resourced languages, have less data available for system training. The state-of-the-art systems, based on deep neural architectures rely on the availability of large datasets running into hundreds of hours. Lately, the MASRI project at the University of Malta has compiled a corpus of around 8 hours of speech. This is the largest corpus of Maltese speech data for ASR, but is still insufficient for training on such deep neural architectures with the same expectation of success as other high resource languages. One possible way of augmenting the currently available training data for ASR in Maltese is to fine-tune the recogniser to work on a single target voice. Synthetic data produced by the FITA synthesiser can help to produce limitless hours of training data. However, all real speech utterances to be transcribed would therefore require voice conversion to the same target voice, which is then fed to the ASR. In this study, the possibility of utilising autoencoder-based voice conversion techniques to perform this operation was investigated. The first part of the work consisted of the creation of an aligned parallel dataset of Mel spectrograms of utterances from the multispeaker MASRI corpus and the Maltese speech synthesiser. In the second stage, several autoencoder architectures were trained using this parallel dataset in order to study the possibilities of converting several human voices to one target synthetic voice. Three different autoencoder architectures were selected from all architectures and were titled Basic-AE, U-Net-AE and AE-DNN. Unseen data from the created dataset was used to test the architectures. Each architecture was tested and evaluated by observing the output consisting of the Mel spectrogram and the corresponding audio predicted by each model, by Log Spectral Distortion (LSD) objective evaluation and Mean Opinion Score (MOS) subjective evaluation from a survey that tested voice similarity and intelligibility. Although direct comparison was not possible since different datasets were used, LSD results outperformed similar work using AE-DNN technique but MOS results were slightly lower than other similar works.
Description: M.Sc. (Melit.)
URI: https://www.um.edu.mt/library/oar/handle/123456789/108015
Appears in Collections:Dissertations - FacICT - 2022
Dissertations - FacICTCCE - 2022

Files in This Item:
File Description SizeFormat 
22MSPMLFT003.pdf
  Restricted Access
11.42 MBAdobe PDFView/Open Request a copy


Items in OAR@UM are protected by copyright, with all rights reserved, unless otherwise indicated.