Please use this identifier to cite or link to this item: https://www.um.edu.mt/library/oar/handle/123456789/141967
Title: Using language models to analyse cyber threat intelligence reports
Authors: Briffa, Leanne (2025)
Keywords: Cyber intelligence (Computer security)
Artificial intelligence
Computer crimes -- Prevention
Computer security -- Technological innovations
Natural language processing (Computer science)
Anomaly detection (Computer security)
Issue Date: 2025
Citation: Briffa, L. (2025). Using language models to analyse cyber threat intelligence reports (Master’s dissertation).
Abstract: Cybersecurity defenders face increasing challenges in combating cybercrime. While Artificial Intelligence (AI) tools such as anomaly detection are widely employed, the potential of Cyber Threat Intelligence (CTI) remains underutilized. Ex isting approaches often use language models fine-tuned on Named Entity Recognition (NER) tasks to extract indicators of compromise (IOCs) from CTI reports, but typically target simple indicators identifiable with regular expressions or entities intended for structured threat documentation — such as malware names — which offer limited value in Security Information and Event Management (SIEM) contexts, rather than complex, context-dependent indicators. Windows registry modifications serve as a suitable case study to investigate this potential, due to the variability of registry values, which may take several forms such as filepaths. Language models may be capable of distinguishing between generic filepaths and those occurring in a registry context—something standard regular expressions cannot achieve. This work therefore aims to assess the extent to which transformer-based language models can extract such context-rich IOCs from CTI text, in comparison to regular expression-based methods. The study begins with an analysis of existing approaches and proceeds to curate a dataset that includes registry-related entities, using a combination of CTI repository harvesting and synthetic data generation. Four transformer models (BERT, SecureBERT, and two variants continually pretrained on registry-related text) are fine-tuned for NER tasks on this dataset. An iterative dataset development process is adopted to address limitations identified in previous versions. The results show that transformer-based models are effective in extracting context-dependent entities such as registry paths, registry value filepaths, and process names. Such entities are not easily captured through regular expressions but structured indicators are still best extracted using rule-based methods. Among the models tested, SecureBERT consistently achieves the highest performance, with F1 scores exceeding 90%. The results also indicate that using separate, entity-specific models can be more effective than relying on a single, general-purpose NER model when addressing diverse IOC categories. Building on these insights, an automated CTI pipeline is proposed integrating regular expressions and trained NER models to extract IOCs, alongside mapping attack patterns to MITRE ATT&CK techniques through semantic similarity. These together can be used to formulate queries using the Elastic Query Language, for retrieving corresponding log events in Elastic SIEM. The approach leverages Elastic’s ATT&CK-mapped ruleset, where extracted IOCs populate variable fields in pre-defined detection rules, optimizing defender resources and improving alerting by focusing on known, actively exploited threats. Ultimately, this work presents an unprecedented investigation into the end-to-end use of language models for CTI processing and contributes an improved NER dataset including registry-related entities.
Description: M.Sc.(Melit.)
URI: https://www.um.edu.mt/library/oar/handle/123456789/141967
Appears in Collections:Dissertations - FacICT - 2025
Dissertations - FacICTAI - 2025

Files in This Item:
File Description SizeFormat 
2519ICTICS520000013750_1.PDF
  Restricted Access
1.41 MBAdobe PDFView/Open Request a copy


Items in OAR@UM are protected by copyright, with all rights reserved, unless otherwise indicated.