Please use this identifier to cite or link to this item:
https://www.um.edu.mt/library/oar/handle/123456789/132780| Title: | Understanding activity in private and public setups using 3D video content |
| Authors: | Gutev, Alexander (2024) |
| Keywords: | Pattern recognition systems Image processing -- Digital techniques Machine learning |
| Issue Date: | 2024 |
| Citation: | Gutev, A. (2024). Understanding activity in private and public setups using 3D video content (Doctoral dissertation). |
| Abstract: | Human action recognition (HAR), which deals with classifying human action in video, is a core component of behaviour monitoring which has found applications in surveillance, security, sports, traffic management, medical monitoring and assisted living systems. The state of the art in human action recognition depends on machine learning techniques, which require large datasets for training. This results in huge amounts of time spent in training as well as a large power consumption, which reduces the feasibility of adopting such methods in real‐world applications. Most of the data present in video consists of background clutter that is irrelevant to human action classification. Thus, the training time can be significantly reduced by limiting the data that is processed to only those regions that capture the action taking place. A viable strategy for identifying these regions with low effort is motion saliency detection given that human action necessitates motion. The main contributions of this work include a novel solution for identifying regions that capture human actions and a new HAR method that uses this solution to achieve a classification accuracy that is comparable to the state of the art however with a significant reduction in the time spent in training and inference. In this thesis various solutions to motion saliency detection were explored. A new motion saliency solution was developed as existing solutions were found to be too computationally intensive for any reduction in training time to be realised by their adoption in a HAR pipeline. The use of this motion saliency solution in HAR methods was explored. It was found that the highest classification is achieved with the Model‐based Multimodal Network (MMNet) [1] method, which is a multimodal HAR method that fuses the classification results of the skeleton and colour modalities. A new HAR method, MMNet with motion saliency (MMNet‐MS), was developed that is based on the MMNet method. Whilst MMNet relies on the OpenPose [2] tool that estimates skeleton joint coordinates, to identify the regions that are relevant to action classification, the proposed MMNet‐MS identifies these regions using motion saliency detection to replace the computationally expensive OpenPose skeleton estimation step. Experimental results showed that the proposed MMNet‐MS method achieves a comparable classification accuracy, on average 75.91%, to MMNet, which has an accuracy of 76.67% on average. In private settings such as the TST fall detection dataset [3], the accuracy of the proposed MMNet‐MS, 55.69%, surpasses that of MMNet, 29.55%. A significant reduction in training and inference time is achieved where MMNet takes on average 28.63 hours to train and 27 milliseconds to classify an action in a video while MMNet‐MS takes on average 14.92 hours to train and 9 milliseconds to classify an action. This reduction in training time leads to a reduced power consumption in most cases particularly on the NTU‐60 dataset where MMNet consumed on average 3.43 kWh during training while MMNet‐MS consumed an average of 1.80 kWh during training. |
| Description: | Ph.D.(Melit.) |
| URI: | https://www.um.edu.mt/library/oar/handle/123456789/132780 |
| Appears in Collections: | Dissertations - FacICT - 2024 Dissertations - FacICTCCE - 2024 |
Files in This Item:
| File | Description | Size | Format | |
|---|---|---|---|---|
| 2501ICTCCE600000008759_1.PDF | 28.47 MB | Adobe PDF | View/Open |
Items in OAR@UM are protected by copyright, with all rights reserved, unless otherwise indicated.
