Generating datasets through data source analysis using ADaGe

Please use this identifier to cite or link to this item: https://www.um.edu.mt/library/oar/handle/123456789/26429

Title:	Generating datasets through data source analysis using ADaGe
Authors:	Xuereb, Matthew
Keywords:	Computer algorithms Quantitative research
Issue Date:	2017
Abstract:	The generation of a synthetic dataset is the simulation of an existing dataset which is able to retain or modify the proportions and characteristics of its real-world counterpart. Using synthetic datasets as opposed to real-world data has various advantages. Synthetic datasets can safeguard sensitive information by replacing it with fictitious subject matter (Wu et al., 2016). It can also generate synthetic conditions which cannot be found in real-world datasets by modifying its field properties. Changing the distributions of values within a dataset can be useful to perform extensive testing on new algorithms for the identification of robustness issues and any errors within (Ayala-Rivera, McDonagh, Cerqueus, & Murphy, 2013). This has the potential to reveal more software bugs within a product which is essential during testing. This study aims to extend the synthetic data generation tool, ADaGe (Camilleri & Bonello, 2016), to semi-automatically infer information from an existing dataset with the aim of replicating its characteristics and properties onto a synthetic dataset. The main objective of this research is to automate the process of extracting data properties from the source database and infer not only the structure of the data, but also the patterns that can be used to generate semantically similar data. By applying EDA techniques, relevant information on the relationship between different fields can be acquired. In contrast to existing data generation tools which generally generate data on a column-by-column basis, this will enable us to replicate the relationships between the various attributes of a table. Results show that by identifying the field types in a table and gathering the relevant statistics, we can acquire the information needed to replicate the relationships between different attributes. Percentage frequency distribution statistics retained relationships between categorical fields, while binning proved effective in preserving the distribution in quantitative fields. Regular expressions were used to define values for text fields which are known to follow a pre-specified pattern.
Description:	B.SC.IT(HONS)
URI:	https://www.um.edu.mt/library/oar//handle/123456789/26429
Appears in Collections:	Dissertations - FacICT - 2017

Files in This Item:

File	Description	Size	Format
17BITSD031.pdf Restricted Access		2.5 MB	Adobe PDF	View/Open Request a copy

Show full item record Statistics