Please use this identifier to cite or link to this item:
Title: Automated dataset generator
Authors: Camilleri, Sheryl
Keywords: Computer software -- Testing
Big data
Database management
Issue Date: 2016
Abstract: Over the years, effort put into maintaining a system has shifted to testing the system. In fact, many IT organisations today spend more than 40% of the project’s budget on Software Testing (Capgemini, 2015; Tata, 2015). One way of making the testing process easier is to automatically generate synthetic data that would be similar to the data inputted in the real environment. The benefits offered by an automated process are invaluable since real datasets can be hard to obtain due to data sensitivity and confidentiality issues. The generation of synthetic data is not an easy task because each system has different requirements. For instance, some systems have a well-defined data structure and the size of the data they would be dealing with can be estimated, other systems such as real-time processing systems have no control over the amount of data being inputted to the system. ADaGe provides a user interface that enables the end-user to define and tweak the dataset definition in the way that satisfies the system’s requirements. The end-user has the option to generate datasets with a fixed size or stream the generated data. ADaGe can generate structured data including basic data types such as random strings, integers and real numbers as well as first and last names from existing datasets. Additionally, it can generate unstructured data by reading from a book collection. In order to generate more statistically significant data, ADaGe can also generate values according to the normal distribution. Moreover, ADaGe makes use of current technologies to scale according to the user’s requirements, making it capable of generating large datasets efficiently by distributing the work amongst several nodes on a cluster. Results show that the application can generate small to large datasets with varying data types efficiently. When compared to other open-source tools, ADaGe proved to be scalable and time-efficient. ADaGe is also capable of generating a stream of data that can be stopped by an action from the user. With regards to the generation of values according to the normal distribution, statistical tests show that the data generated fit the distribution. Lastly, the application was successfully used to generate data for real-life scenarios.
Description: B.SC.IT(HONS)
Appears in Collections:Dissertations - FacICT - 2016
Dissertations - FacICTCIS - 2016

Files in This Item:
File Description SizeFormat 
  Restricted Access
2.51 MBAdobe PDFView/Open Request a copy

Items in OAR@UM are protected by copyright, with all rights reserved, unless otherwise indicated.