Please use this identifier to cite or link to this item: https://www.um.edu.mt/library/oar/handle/123456789/137838
Title: A multilingual approach to the digitalisation of Maltese historical manuscript
Authors: Koppens, Thomas (2025)
Keywords: Manuscripts -- Malta
Cultural property -- Malta
Data sets -- Malta
Transcribing services -- Malta
Imaging systems -- Image quality
Issue Date: 2025
Citation: Koppens, T. (2025). A multilingual approach to the digitalisation of Maltese historical manuscript (Bachelor's dissertation).
Abstract: The digitalisation of historical manuscripts is crucial for preserving cultural heritage and enhancing access for research and public use. Digitising historical works not only preserves the original content, but also enables their widespread distribution, making them available for research, education, and public use, while additionally making the text machine‐readable and searchable for a wide range of digital applications. This work focuses on the development and training of Handwritten Text Recognition (HTR) models tailored for historical Maltese manuscripts. The Maltese Handwritten Manuscripts (MHM) dataset includes a selection of scanned document pages with corresponding transcriptions, which were manually produced by specialists at the University of Malta Library. A key challenge in this work is the automatic identification of individual text lines for line‐level recognition. While fully automatic methods proved insufficient, a semi‐automated approach combining horizontal projection profiling and manual refinement was developed to ensure accurate segmentation. Two versions of the dataset were created, one with original transcriptions and the other with standardized and corrected text, to facilitate the training and evaluation of the models. Two HTR models, HTR‐VT2 and VAN3, were trained on these datasets, starting with pre‐training on a separate dataset before fine‐tuning on the Maltese manuscript data. The impact of freezing specific layers during training was also explored, with results showing that factors such as image quality and transcription consistency significantly impacted model performance. These experiments demonstrated the value of tailored preprocessing and fine‐tuning techniques in achieving robust HTR performance, particularly for low‐resource historical datasets. Post‐processing was implemented to refine the output of the HTR models and improve transcription accuracy. Different spelling correction methods, including statistical approaches and a neural model, were trained using ground truth annotations from an iteration of the MHM dataset and tested on the outputs from VAN. These methods aimed to address common errors in the HTR output, such as misspellings and inconsistencies, but were ultimately unable to reduce error rates. The best approach in this work obtained a Character Error Rate (CER) of just 4.7% and Word Error Rate (WER) of 13.6%. Overall, this work lays the foundation for a continued effort to automate the digitisation of cultural and historical Maltese manuscripts.
Description: B.Sc. (Hons) ICT(Melit.)
URI: https://www.um.edu.mt/library/oar/handle/123456789/137838
Appears in Collections:Dissertations - FacICT - 2025
Dissertations - FacICTAI - 2025

Files in This Item:
File Description SizeFormat 
2508ICTICT390900017413_1.PDF
  Restricted Access
2.42 MBAdobe PDFView/Open Request a copy


Items in OAR@UM are protected by copyright, with all rights reserved, unless otherwise indicated.