Please use this identifier to cite or link to this item: https://www.um.edu.mt/library/oar/handle/123456789/74891
Title: A diphone-based Maltese speech synthesis system
Authors: Magro, Daniel (2019)
Keywords: Vision disorders -- Malta
Literacy -- Malta
Maltese language -- Malta
Speech synthesis
Speech processing systems
Issue Date: 2019
Citation: Magro, D. (2019). A diphone-based Maltese speech synthesis system (Bachelor's dissertation).
Abstract: While there has been work in the area, at the time of writing there are no available TTS systems for Maltese, thus almost the entire system had to be built from scratch. In light of this, a Diphone-Based Concatenative Speech System was chosen as the type of synthesiser to implement. This was due to the minimal amount of data needed, requiring less than 20 minutes of recorded speech. A simple `Text Normalisation' component was built, which converts integers between 0 and 9,999 written as numerals to their textual form. While this is far from covering all the possible forms of Non-Standard Words (NSWs) in Maltese, the modular nature in which it was built allows for easy upgrading in future work. A `Grapheme to Phoneme (G2P)' component which then converts the normalised text into a sequence of phonemes (basic sounds) that make up the text was also created, based on an already existing implementation by Crimsonwing. Three separate `Diphone Databases' were made available to the speech synthesiser. One of these is the professionally recorded English Diphone database FestVox's `CMU US KAL Diphone'1. The second and third were created as part of this work, one with diphones manually extracted from the recorded carrier phrases in Maltese, the other with diphones automatically extracted using Dynamic Time Warping (DTW). The Time Domain - Pitch Synchronous OverLap Add (TD-PSOLA) concatenation algorithm was implemented to string together the diphones in the sequence specified by the G2P component. On a scale of 1 to 5, the speech synthesised when using the diphone database of manually extracted diphones concatenated by the TD-PSOLA algorithm was scored 2.57 for naturalness, 2.72 for clarity, and most important of all, 3.06 for Intelligibility by evaluators. These scores were higher than those obtained when using the professionally recorded English diphone set.
Description: B.SC.ICT(HONS)ARTIFICIAL INTELLIGENCE
URI: https://www.um.edu.mt/library/oar/handle/123456789/74891
Appears in Collections:Dissertations - FacICT - 2019
Dissertations - FacICTAI - 2019

Files in This Item:
File Description SizeFormat 
Magro Daniel.pdf
  Restricted Access
2.45 MBAdobe PDFView/Open Request a copy


Items in OAR@UM are protected by copyright, with all rights reserved, unless otherwise indicated.