Web document clustering using CUDA

Please use this identifier to cite or link to this item: https://www.um.edu.mt/library/oar/handle/123456789/92813

Title:	Web document clustering using CUDA
Authors:	Pirrone, Mauro (2010)
Keywords:	Document clustering File organization (Computer science) Information storage and retrieval systems Graphics processing units
Issue Date:	2010
Citation:	Pirrone, M. (2010). Web document clustering using CUDA (Bachelor's dissertation).
Abstract:	The World Wide Web can be considered as the biggest source of information and millions of online users engage in information retrieval on a daily basis. However, the rapid growth of the internet in such a short period of time is what makes web clustering a challenging task. It is in fact a subset of the larger field of data clustering, which amongst others borrows concepts from the fields of information retrieval (IR), natural language processing (NLP), and machine learning (ML). The main driving force behind this project is the use of parallel computing techniques that can effectively use computational resources to greatly improve the overall performance. The goal here is to investigate the performance of the Graphics Processing Unit (GPU), a processor capable of making billions of transactions per second, using very large data sets. This will be applied in a high dimensional space such as web document clustering whereby time and space complexity play a significant role. NVIDIA's Compute Unified Device Architecture (CUDA) is a software platform for mas sively parallel high-performance computing on the company's powerful GPUs. The appli cations for CUDA are beyond pure graphics - in this project, CUDA has been used to accelerate the k-means algorithm which is one of the most popular unsupervised clustering algorithms. Many enhancements of the original algorithm have been put forward; however, many of them are based on Single Instruction Single Data (SISD) architecture processors also known as CPUs, which partly ignored the inherent paralleled characteristic of the algorithms. In order to accelerate the k-means clustering algorithm, both data object labeling and centroids calculation were offloaded to the GPU in parallel. The algorithm has been designed with fine-grain parallelism and takes advantage of CUDA's fast shared memory. A common problem that plagues a clustering algorithm that deals with massive amount of data, is that loading the entire data set into memory is not possible. For this reason, a divide-and-conquer approach has been adopted which divides the data set into subsets that can fit in the main memory (and GPU memory) and apply the clustering algorithm separately on these subsets. The clustering algorithm has been developed in a modular fashion and can be applied to many clustering applications. We have combined the k-means algorithm in an Information Retrieval application while making use of the Vector Space Model (VSM) as a data set for the clustering algorithm. The VSM has been created in a separate module using Wikipedia as sample data and although the focus is mainly on the performance of the clustering algorithm, basic evaluation on the precision of the resulting clusters and IR system have been included. The Intel Core i7 920 and NVIDIA GTX 260 have been used to evaluate the performance of the GPU implementation over the single threaded CPU version. Experiments showed that when using data sets of small dimension, the GPU-accelerated version is 4x to 20x faster. For data sets in a very large dimensional space, such as those used in web document clustering, the GPU-accelerated version is, on average, 2x to 4x faster. Speed-up depends on various parameters, mainly on the number of clusters, the number of dimensions, and data set size. From these results, we can show the potential of the CUDA programming model as a general purpose parallel programming paradigm.
Description:	B.Sc. IT (Hons)(Melit.)
URI:	https://www.um.edu.mt/library/oar/handle/123456789/92813
Appears in Collections:	Dissertations - FacICT - 2010 Dissertations - FacICTCS - 2010-2015

Files in This Item:

File	Description	Size	Format
B.SC.IT(HONS)_Pirrone_Mauro_2010.PDF Restricted Access		10.77 MB	Adobe PDF	View/Open Request a copy

Show full item record Statistics