Please use this identifier to cite or link to this item: https://www.um.edu.mt/library/oar/handle/123456789/137976
Title: Maltese spelling and error correction
Authors: Oliveri, Federico (2025)
Keywords: Maltese language -- Grammar
Neural networks (Computer science) -- Malta
Transcription
Artificial intelligence -- Malta
Issue Date: 2025
Citation: Oliveri, F. (2025). Maltese spelling and error correction (Bachelor's dissertation).
Abstract: Grammatical error correction (GEC) aims to identify and correct mistakes within a piece of text. GEC tools are commonly used in a variety of contexts, including education, government, and media. Early approaches to GEC relied on hard and fast grammatical rules to detect errors. Modern approaches utilize neural models that learn to correct these mistakes by training on a large collection of example sentences. As a result, GEC tools for languages with large speaking populations have achieved remarkable accuracy in their corrections due to the breadth of available resources. However, low‐resource languages – such as Maltese – necessitate the use of more sophisticated approaches to make up for the lack of available resources. This work aims to improve on currently available solutions for Maltese GEC. Grammatical error correction can be thought of as a translation task, where one would translate some ungrammatical text into grammatical text. Using this reasoning, a neural machine translation model can be trained on a dataset consisting of incorrect and correct sentence pairs. Ideally, this dataset is constructed entirely of real‐world corrections to closely model mistakes made by actual speakers. However, for Maltese, no sufficiently large dataset of this kind currently exists. To remedy this issue, a dataset can be synthesized by introducing errors into otherwise grammatical sentences, thereby generating the desired sentence pairs. For synthetic data generation, two primary monolingual sources provide clean Maltese text: Korpus Malti, a comprehensive collection of grammatically correct Maltese texts, and a curated subset of Common Voice transcriptions. For each sentence, errors were introduced probabilistically using several token‐ and character‐level operations. These operations include substituting one word for another, deletion, insertion, and swapping, along with character‐level modifications such as capitalisation or anglicising Maltese characters (e.g. ‘Ġ’ ‐> G). This results in a dataset of sentence pairs suitable for training. While models trained on only synthetic data have been shown to be sufficient, further training on authentic data improves the accuracy of the model. Two such authentic datasets are available: Qari tal‐Provi, a manually curated collection of grammatical errors and their corrections in Maltese text, and data gathered from a novel data collection tool developed by A. Busuttil, which captures transcription errors by preventing users from correcting mistakes while transcribing spoken Maltese sentences. Additional data was extracted from the Maltese Wikipedia’s revision history by identifying edits likely to be grammatical corrections.
Description: B.Sc. (Hons) ICT(Melit.)
URI: https://www.um.edu.mt/library/oar/handle/123456789/137976
Appears in Collections:Dissertations - FacICT - 2025
Dissertations - FacICTAI - 2025

Files in This Item:
File Description SizeFormat 
2508ICTICT390905079901_1.PDF
  Restricted Access
767.25 kBAdobe PDFView/Open Request a copy


Items in OAR@UM are protected by copyright, with all rights reserved, unless otherwise indicated.