Please use this identifier to cite or link to this item:
https://www.um.edu.mt/library/oar/handle/123456789/52960
Title: | A big data approach for clustering large chemical datasets |
Authors: | Cassar, Jurgen |
Keywords: | Data mining Machine learning Big data Drugs |
Issue Date: | 2019 |
Citation: | Cassar, J. (2019). A big data approach for clustering large chemical datasets (Master's dissertation). |
Abstract: | Physically testing compounds for their biological activity with respect to a target protein is an expensive and time consuming problem in the drug discovery process. Clustering is one of the techniques that enables a more efficientmethodofselectingcompoundsfortesting. Thisisdonebygrouping similar molecules together with the advantage of testing only the compounds from the clusters which contain compounds which exhibit some activity. However, large molecular datasets pose a challenge to efficiently cluster the dataset. Hierarchical clustering techniques are shown to be the most effective in separating active compounds from inactive ones, however the time and space complexity make them impractical for large datasets. Distribution of clustering algorithms may be a possible solution, with Big Data techniques enabling large scale distribution of tasks. In this research, D-Butina a distributed version of Butina clustering algorithm was implemented. The algorithm was extended to create DLSH-Butina algorithm which uses approximation method to identify neighbours. Bothimplementationsobtainsatisfactoryresults,withD-Butinaimplementationproviding increasing speedup of 2.4 and 3.9 when using 5 and 10 distributed nodes, while DLSH-Butina achieves a speedup of 4.1 and 8.4 respectively over the serial approach. Additionally, the clusters achieved by the D-Butina andDLSH-Butinaalgorithmsachievebetterseparationofactiveswithinthe clusters generated than Bisecting k-means. |
Description: | M.SC.ARTIFICIAL INTELLIGENCE |
URI: | https://www.um.edu.mt/library/oar/handle/123456789/52960 |
Appears in Collections: | Dissertations - FacICT - 2019 Dissertations - FacICTAI - 2019 |
Files in This Item:
File | Description | Size | Format | |
---|---|---|---|---|
19MAIPT004.pdf | 2.42 MB | Adobe PDF | View/Open |
Items in OAR@UM are protected by copyright, with all rights reserved, unless otherwise indicated.