AI for partially observable stochastic games

Please use this identifier to cite or link to this item: https://www.um.edu.mt/library/oar/handle/123456789/145603

Title:	AI for partially observable stochastic games
Authors:	Cutajar, Cristina (2026)
Keywords:	Reinforcement learning Algorithms Card games
Issue Date:	2026
Citation:	Cutajar, C. (2026). AI for partially observable stochastic games (Master's dissertation).
Abstract:	Reinforcement Learning (RL) has shown significant success in a variety of game environments. Despite this, some game characteristics continue to challenge RL. Jaipur, a competitive two‐player turn‐based board game, contains a number of such challenging features. It is characterised by partial observability, stochasticity and a large discrete action space of 25,499 possible actions. Moreover, it also contains elements of randomness, immediate and long‐term rewards with different consequences, and multiple different strategies that can be adopted. This study aims to analyse the performance of several state‐of‐the‐art techniques and algorithms that can be implemented, to not only mitigate the complexities of applying RL on Jaipur, but also improve the performance and training efficiency. We propose the implementation of the action masking, action embedding, hierarchical RL with centralised critic and policy cloning techniques. Moreover, hyperparameter tuning was applied using the PBT algorithm, and to evaluate the effects of partial observability on the RL process, different levels of observability were provided in separate experiments. For each implementation, the PPO, A2C, DQN and DDQN algorithms were trained with two separate policies, to reflect Jaipur’s two competitive players. The scores obtained during training, as well as when the policies of each model were played against each other on 1000 unique games, were evaluated quantitatively against scores obtained by human players and by the other models. The results demonstrate that all algorithms and techniques achieved good scores, comparable to those achieved by humans. Action masking delivered the best overall performance, with high scores achieved at high computational efficiency across most algorithms. Action embedding obtained better scores for PPO but required the longest training times, whilst the training times for DQN and DDQN were the shortest. Meanwhile hierarchical RL with centralised critic provided greater training stability, however, the scores achieved were lower across most models and the training times were significantly prolonged. Both the hyper‐parameter tuning and policy cloning technique proved to be beneficial, as the performance of the algorithms increased in less training steps. Meanwhile, the varying levels of observability had minimal impact on the policies’ performance, suggesting that the algorithms managed to discover strong strategies that achieved high scores even with partial observability. Furthermore, an action selection analysis of the policies’ decisions during simulated games was carried out, from which it was concluded that all the policies adopted intelligent and interesting strategies, similar to those of human players.
Description:	M.Sc.(Melit.)
URI:	https://www.um.edu.mt/library/oar/handle/123456789/145603
Appears in Collections:	Dissertations - FacICT - 2026

Files in This Item:

File	Description	Size	Format
2619ICTICT501000014918_1.PDF		24.1 MB	Adobe PDF	View/Open

Show full item record Statistics