Please use this identifier to cite or link to this item: https://www.um.edu.mt/library/oar/handle/123456789/53023
Title: GO term predictionsin CATH : a machine learning approach
Authors: Penza, Kenneth
Keywords: Machine learning
Protein-protein interactions
Computational biology
Bioinformatics
Issue Date: 2019
Citation: Penza, K. (2019). GO term predictionsin CATH : a machine learning approach (Master's dissertation).
Abstract: Proteins perform different tasks with in an organism such as regulation and signalling. Protein function is characterised through laboratory experiments or predicted using computational methods. Protein function is described using GeneOntology (GO) terms. Protein sequencing is the process of determining the amino acid sequence that makes up the protein. Technological improvements in sequencing technology is making the process more accessible, leading to a never-increasing growth rate of protein databases. The low throughput of laboratory experiments and increasing rate of proteins deposited in protein databases has made protein function prediction (PFP) a central problem in computational biology. Domains are independent structural units that have their own structural and function. Structural protein databases categorise protein using structural properties. CATH is a structural database of protein domain using four levels of hierarchy. This research applied machine learning (ML) techniques to improve PFP. The protein function aspect investigated was molecular function. This research uses a labelled ML data set consisting for GO terms, features extracted from protein sequence and proportions computed from protein databases such as CATH and PFAM. The problem was tackled by defining five experiments that were executed on Homo sapiens and E. coli datasets. The model performance was measured using Fmax computed as per Critical Assessment of Functional Annotation (CAFA) shared task methodology. The first experiment applied automatic feature selection using four different fitness methods based on Random Forest and Support Vector Machine. The second experiment applied different neural network architectures to the datasets. The third experiment applied cross validation to the automatic feature selection process to assess dataset sensitivity in the feature selection process. The fourth experiment investigated the amount of training data required by best performing ML model for each species identified in the first experiment. The fifth experiment investigated the application of the best performing ML model for each species identified in the first experiment to other species. The methods selected in the first and second experiment were evaluated on the CAFA3 targets. The RF with Gini node splitting criterion outperforms the best CAFA2 methods by an Fmax of 0.01 for Homo sapiens and an Fmax of 0.16 for E. coli. The cross validation of the automatic feature selection shows that E. coli models were more sensitive to changes in the dataset with respect to Homo sapiens models. The smaller E. coli dataset explains the sensitivity observed. The training dataset size experiment shows that the models have similar performance levels with the sameamountoftrainingdata. The experiment that applied species-specific models to different species confirms the intuition that models perform well on species of the same domain, and that performance decreases as evolutionary distance increases. The results show that features based on protein structure and proportions from structural protein databases permit reliable PFP.
Description: M.SC.ARTIFICIAL INTELLIGENCE
URI: https://www.um.edu.mt/library/oar/handle/123456789/53023
Appears in Collections:Dissertations - FacICT - 2019
Dissertations - FacICTAI - 2019

Files in This Item:
File Description SizeFormat 
19MAIPT011.pdf3.84 MBAdobe PDFView/Open


Items in OAR@UM are protected by copyright, with all rights reserved, unless otherwise indicated.