A transfer learning approach to facial image caption generation generating captions of images of faces from Face2Text

Please use this identifier to cite or link to this item: https://www.um.edu.mt/library/oar/handle/123456789/141973

Title:	A transfer learning approach to facial image caption generation generating captions of images of faces from Face2Text
Authors:	Abdilla, Shaun (2021)
Keywords:	Subtitles (Motion pictures, television, etc.) Generative artificial intelligence -- Malta Convolutions (Mathematics) Neural networks (Computer science)
Issue Date:	2021
Citation:	Abdilla, S. (2021). A transfer learning approach to facial image caption generation generating captions of images of faces from Face2Text (Master's dissertation).
Abstract:	Current caption generation models do not adequately describe the subject’s appearance when faced with images of human faces. The creation of the Face2Text dataset led us to explore the feasibility of using transfer learning from domain-relevant models to build a model for this use. We build an encoder-decoder Convolutional Neural Network(CNN) - Long Short Term Memory(LSTM) pipeline model, employing an Attention mechanism and VG-GFace/ResNet CNNs, to compare different optimized variants and determine the suitability of generated captions from the Face2Text dataset. Comparisons are drawn through both automated metrics and human evaluation by 76 English-speaking participants. The captions generated by the VGGFace-LSTM + Attention model are closest to the Ground Truth according to human evaluation. Highest METEOR scores (0.4834) are obtained by the RGFA (ResNet, GloVe, Attention) model, the REFA (ResNet, Uninitialised Word Embeddings, Attention) model obtained the highest CIDEr and CIDErD results (1.2520 and 0.6860 respectively), whilst the best BLEU-4 results were obtained by both the RGFA and REFA models (0.2538). There is less agreement between raters and weak correlation between human evaluation and automated metrics. Qualitatively, most captions give encouraging results, although the model struggles when faced with abnormal facial images. We were successful in our main aim of developing a facial image captioning model for Face2Text using transfer learning, with generated captions being particularly detailed. Despite the results being already fit for use in some areas, possibly beneficial for image retrieval and users who are blind, this is only to be considered as a starting point, and is an encouraging result and baseline for future work.
Description:	M.Sc.(Melit.)
URI:	https://www.um.edu.mt/library/oar/handle/123456789/141973
Appears in Collections:	Dissertations - FacICT - 2021 Dissertations - FacICTAI - 2021

Files in This Item:

File	Description	Size	Format
2219ICTICS520005015859_1.PDF Restricted Access		16.82 MB	Adobe PDF	View/Open Request a copy

Show full item record Statistics