Please use this identifier to cite or link to this item: https://www.um.edu.mt/library/oar/handle/123456789/52960
Title: A big data approach for clustering large chemical datasets
Authors: Cassar, Jurgen
Keywords: Data mining
Machine learning
Big data
Drugs
Issue Date: 2019
Citation: Cassar, J. (2019). A big data approach for clustering large chemical datasets (Master's dissertation).
Abstract: Physically testing compounds for their biological activity with respect to a target protein is an expensive and time consuming problem in the drug discovery process. Clustering is one of the techniques that enables a more efficientmethodofselectingcompoundsfortesting. Thisisdonebygrouping similar molecules together with the advantage of testing only the compounds from the clusters which contain compounds which exhibit some activity. However, large molecular datasets pose a challenge to efficiently cluster the dataset. Hierarchical clustering techniques are shown to be the most effective in separating active compounds from inactive ones, however the time and space complexity make them impractical for large datasets. Distribution of clustering algorithms may be a possible solution, with Big Data techniques enabling large scale distribution of tasks. In this research, D-Butina a distributed version of Butina clustering algorithm was implemented. The algorithm was extended to create DLSH-Butina algorithm which uses approximation method to identify neighbours. Bothimplementationsobtainsatisfactoryresults,withD-Butinaimplementationproviding increasing speedup of 2.4 and 3.9 when using 5 and 10 distributed nodes, while DLSH-Butina achieves a speedup of 4.1 and 8.4 respectively over the serial approach. Additionally, the clusters achieved by the D-Butina andDLSH-Butinaalgorithmsachievebetterseparationofactiveswithinthe clusters generated than Bisecting k-means.
Description: M.SC.ARTIFICIAL INTELLIGENCE
URI: https://www.um.edu.mt/library/oar/handle/123456789/52960
Appears in Collections:Dissertations - FacICT - 2019
Dissertations - FacICTAI - 2019

Files in This Item:
File Description SizeFormat 
19MAIPT004.pdf2.42 MBAdobe PDFView/Open


Items in OAR@UM are protected by copyright, with all rights reserved, unless otherwise indicated.