Please use this identifier to cite or link to this item: https://www.um.edu.mt/library/oar/handle/123456789/92112
Title: Scaling protein motif discovery using tries in ‘Apache Spark’
Authors: Briffa, Ethan Joseph (2021)
Keywords: Spark (Electronic resource : Apache Software Foundation)
Data structures (Computer science)
Information retrieval
Sequence alignment (Bioinformatics)
Issue Date: 2021
Citation: Briffa, E.J. (2021). Scaling protein motif discovery using tries in ‘Apache Spark’ (Bachelor's dissertation).
Abstract: The field of BioInformatics applies computational techniques to Biology. To improve the understanding of proteins, which are large molecules that have specific functions in organisms, requires discovering fixed patterns called motifs inside protein sequences which are indicative of a protein’s structure and function. This research attempts to improve the speed of finding motifs by comparing unknown protein sequences to known protein domains as classified in the CATH hierarchy. The approach adopted in this study uses the Multiple Sequence Alignments (MSA) from proteins found in CATH Functional Families. Each MSA contains motifs which have sequence regions that have been preserved through evolution, known as conserved regions. The representative sequences for the Functional Families are stored as a Suffix Trie that is then used to find potential structures. To improve the efficiency of the search, the suffix trie is implemented using the Spark framework that is used to process large amounts of data efficiently. The Spark architecture offers processing scalability by distributing the process over a number of nodes thereby speeding up the search. The method then determines the best match through a scoring algorithm that ranks the output based on the closest match to a known structural motif. A substitution matrix is also used to consider all possible variations of the conserved regions. This system is compared against a library of Hidden Markov models. The results produced by our system are very comparable to the benchmark system and show that our system has a great potential.
Description: B.Sc. IT (Hons)(Melit.)
URI: https://www.um.edu.mt/library/oar/handle/123456789/92112
Appears in Collections:Dissertations - FacICT - 2021
Dissertations - FacICTCIS - 2021

Files in This Item:
File Description SizeFormat 
21BITSD008.pdf
  Restricted Access
2.28 MBAdobe PDFView/Open Request a copy


Items in OAR@UM are protected by copyright, with all rights reserved, unless otherwise indicated.