Multi-teacher distillation for pretraining a low-resource language

Please use this identifier to cite or link to this item: https://www.um.edu.mt/library/oar/handle/123456789/120555

Title:	Multi-teacher distillation for pretraining a low-resource language
Authors:	Chaudhary, Amit Kumar (2023)
Keywords:	Maltese language -- Data processing Natural language processing (Computer science)
Issue Date:	2023
Citation:	Chaudhary, A.K. (2023). Multi-teacher distillation for pretraining a low-resource language (Master's dissertation).
Abstract:	Masked Language Models like BERT have brought a paradigm shift in the field of Natural Language Processing, achieving state-of-the-art results across numerous NLP tasks. However, pretraining such models for low-resource languages remains a considerable challenge. Existing approaches involve either training language-specific BERT models by collecting large text corpora or using massively multilingual models, which, unfortunately, face the challenge of per-language performance drop when handling multiple languages in a single model termed as ”curse of multilinguality”. A promising solution came in the form of MergeDistill (Khanuja et al., 2021), which can combine monolingual models from similar languages into a unified multilingual model using Knowledge Distillation and finds that the combined model performs better than each individual model. However, this approach remains unexplored for low-resource languages. Our research aims to fill this gap by adapting MergeDistill for the low-resource language Maltese. We aim to understand the impact of pretraining masked language models for such low-resource languages through the method of knowledge distillation from high-resource languages that are linguistically related. Our evaluation focuses on the Maltese language, involving knowledge distillation from a monolingual Maltese teacher model, as well as teachers from Arabic, English, and Italian. We analyze the performance differences among monolingual, massively multilingual, and language-specific multilingual models on both semantic and syntactic tasks. Our results indicate that knowledge distillation enhances the efficiency of pretraining, and the inclusion of related languages contributes positively to the evaluation tasks. Furthermore, these combined models retain most of their performance in evaluation tasks, despite encountering fewer pretraining data and being trained for less training time than monolingual models.
Description:	M.Sc. (HLST)(Melit.)
URI:	https://www.um.edu.mt/library/oar/handle/123456789/120555
Appears in Collections:	Dissertations - FacICT - 2023 Dissertations - FacICTAI - 2023

Files in This Item:

File	Description	Size	Format
2418ICTCSA531005079270_1.PDF		4.46 MB	Adobe PDF	View/Open

Show full item record Statistics