Please use this identifier to cite or link to this item: https://www.um.edu.mt/library/oar/handle/123456789/64245
Title: Named entity recognition for Maltese : a scenario for a low resource language
Authors: Vella, Deborah
Keywords: Natural language processing (Computer science)
Computational linguistics -- Malta
Maltese language
Issue Date: 2020
Citation: Vella, D. (2020). Named entity recognition for Maltese: a scenario for a low resource language (Bachelor's dissertation).
Abstract: Named Entity Recognition (NER) is a subtask in the NLP field whereby named entities such as Person, Organization and Location are identified and labelled in text. NER is a huge contribution to Information Extraction as it identifies the named entities from which the needed information can be extracted, such as their relations also known as entity linking. Generally, state-of-the art NERs are trained on large corpora with Named Entities already tagged through human annotation initiative. However, not all languages have such huge corpora available. In fact, Maltese has neither available NER annotated datasets nor previously created Maltese NER models. Hence, the aim of this study is to conduct research about previous low-resource NERs to obtain enough knowledge to solve the challenging task of creating and evaluating the first Maltese NER system. For this research, we created a small dataset of 500 sentences extracted from a publicly available Maltese corpus and manually annotated them with Person, Location, Organization and Miscellaneous entities using the BIO tagging system. In order to augment our dataset, we experiment with transfer learning by including datasets from other languages which are English, Italian, Spanish and Dutch as they can be deemed as rather similar to Maltese. Our experiments evaluate the use of two techniques which are Conditional Random Fields (CRF) and Bidirectional Long Short-Term Memory Conditional Random Fields (BiLSTM-CRF) as a deep learning approach. Our experiments also had to consider a number of scenarios since there were no specific annotation guidelines for Maltese. Initially, tags were limited to Person, Organisation and Location, with a later introduction of the Miscellaneous tag for further experimentation. The analysis of the tags had to also match what is available in the selected multilingual NER datasets to make the transfer learning streamlined to the Maltese annotations. We also experimented with the size of the multilingual corpora to analyse the impact that other languages can have on Maltese NER. This is done incrementally with the first corpus containing Maltese only and then the others contain one of the following amounts of sentences from each language: 200, 300, 400 and 500. These experiments resulted in a large number of setups, totalling to 40 distinct experiments. The best results are obtained by three equally successful systems achieved by the BiLSTM-CRF’s experiments. One of these systems is trained on Maltese and 300 extra sentences from the other languages without making use of the Miscellaneous tag. The other two systems are trained on Maltese together with 400 and 500 extra sentences from each language but excluding Dutch.
Description: B.SC.ICT(HONS)ARTIFICIAL INTELLIGENCE
URI: https://www.um.edu.mt/library/oar/handle/123456789/64245
Appears in Collections:Dissertations - FacICT - 2020
Dissertations - FacICTAI - 2020

Files in This Item:
File Description SizeFormat 
20BITAI014 - Vella Deborah.pdf
  Restricted Access
1.25 MBAdobe PDFView/Open Request a copy


Items in OAR@UM are protected by copyright, with all rights reserved, unless otherwise indicated.