Event: NOMOCRAT: A Maltese corpus collection tool using an OCR
Date: Friday 12 December 2025
Time: 12:00 - 13:00
Venue: Gateway Building Room 256
There is a need for larger and more balanced corpora of Maltese text. A lot of the readily available digital Maltese text is in PDF format, but extracting digital text from PDFs presents numerous challenges as PDFs are not designed to represent text in a data friendly way but for visual consumption and printing. When extracting text naively from a PDF, headers, footers, footnotes, and figure captions interrupt the main text in the page, since this is not stored separately from these other text elements. Some PDFs make things even worse by being simply scans of printed pages with no copyable text available, or by replacing Maltese characters with other characters that get disguised using special fonts.
While languages with a large amount of online text, like English, can get away with this noise, the small amounts of available Maltese texts need to be handled with precision to avoid wasting data. To solve this, the NOMOCRAT project aims to create a Maltese page layout analyser and a Maltese OCR. A layout analyser automatically detects the different text elements in a page, allowing one to focus only on the main text of the page, while an OCR automatically transcribes what is on the page visually rather than relying on the digital text within the PDF. In this talk, Dr Marc Tanti shall describe the progress of this project.