Please use this identifier to cite or link to this item:
https://www.um.edu.mt/library/oar/handle/123456789/141973| Title: | A transfer learning approach to facial image caption generation generating captions of images of faces from Face2Text |
| Authors: | Abdilla, Shaun (2021) |
| Keywords: | Subtitles (Motion pictures, television, etc.) Generative artificial intelligence -- Malta Convolutions (Mathematics) Neural networks (Computer science) |
| Issue Date: | 2021 |
| Citation: | Abdilla, S. (2021). A transfer learning approach to facial image caption generation generating captions of images of faces from Face2Text (Master's dissertation). |
| Abstract: | Current caption generation models do not adequately describe the subject’s appearance when faced with images of human faces. The creation of the Face2Text dataset led us to explore the feasibility of using transfer learning from domain-relevant models to build a model for this use. We build an encoder-decoder Convolutional Neural Network(CNN) - Long Short Term Memory(LSTM) pipeline model, employing an Attention mechanism and VG-GFace/ResNet CNNs, to compare different optimized variants and determine the suitability of generated captions from the Face2Text dataset. Comparisons are drawn through both automated metrics and human evaluation by 76 English-speaking participants. The captions generated by the VGGFace-LSTM + Attention model are closest to the Ground Truth according to human evaluation. Highest METEOR scores (0.4834) are obtained by the RGFA (ResNet, GloVe, Attention) model, the REFA (ResNet, Uninitialised Word Embeddings, Attention) model obtained the highest CIDEr and CIDErD results (1.2520 and 0.6860 respectively), whilst the best BLEU-4 results were obtained by both the RGFA and REFA models (0.2538). There is less agreement between raters and weak correlation between human evaluation and automated metrics. Qualitatively, most captions give encouraging results, although the model struggles when faced with abnormal facial images. We were successful in our main aim of developing a facial image captioning model for Face2Text using transfer learning, with generated captions being particularly detailed. Despite the results being already fit for use in some areas, possibly beneficial for image retrieval and users who are blind, this is only to be considered as a starting point, and is an encouraging result and baseline for future work. |
| Description: | M.Sc.(Melit.) |
| URI: | https://www.um.edu.mt/library/oar/handle/123456789/141973 |
| Appears in Collections: | Dissertations - FacICT - 2021 Dissertations - FacICTAI - 2021 |
Files in This Item:
| File | Description | Size | Format | |
|---|---|---|---|---|
| 2219ICTICS520005015859_1.PDF Restricted Access | 16.82 MB | Adobe PDF | View/Open Request a copy |
Items in OAR@UM are protected by copyright, with all rights reserved, unless otherwise indicated.
