Enhancing Coverage and Robustness of Database Generators

Rajkumar, S

dc.contributor.advisor	Haritsa, Jayant R
dc.contributor.author	Rajkumar, S
dc.date.accessioned	2021-11-29T05:17:14Z
dc.date.available	2021-11-29T05:17:14Z
dc.date.submitted	2021
dc.identifier.uri	https://etd.iisc.ac.in/handle/2005/5528
dc.description.abstract	Generating synthetic databases that capture essential data characteristics of client databases is a common requirement for enterprise database vendors. This need stems from a variety of use-cases, such as application testing and assessing performance impacts of planned engine upgrades. A rich body of literature exists in this area, spanning from the early techniques that simply generated data ab-initio to the contemporary ones that use a predefined client query workload to guide the data generation. In the latter category, the aim specifically is to ensure volumetric similarity -- that is, assuming a common choice of query execution plans at the client and vendor sites, the output row cardinalities of individual operators in these plans are similar in the original and synthetic databases. Hydra is a recently proposed data regeneration framework that provides volumetric similarity. In addition, it also provides a mechanism to generate data dynamically during query execution, using a minuscule database summary. Notwithstanding its desirable characteristics, Hydra has the following critical limitations: (a) limited scope of SQL operators in the input query workload, (b) poor scalability with respect to the number of queries in the input workload, and (c) poor volumetric similarity on unseen queries. The data generation algorithm internally uses a linear programming (LP) solver that throttles the workload scalability. This not only puts a threshold on the training (seen) workload size but also reduces the accuracy for test (unseen) queries. Robustness towards test queries is further adversely affected by design choices such as a lack of preference among candidate synthetic databases, and artificial skew in the generated data. In this work, we present an enhanced version of Hydra, called High-Fidelity Hydra (HF-Hydra), which attempts to address the above limitations. To start with, we expand the SQL operator coverage to also include the LIKE operator, and, in certain restricted settings, projection-based operators such as GROUP BY and DISTINCT. To sidestep the challenge of workload scalability, HF-Hydra outputs not one, but a suite of database summaries such that they collectively cover the entire input workload. The division of the workload into the associated sub-workloads is governed by heuristics that aim to balance robustness with LP solvability. For generating richer database summaries, HF-Hydra additionally exploits metadata statistics maintained by the database engine. Further, the database query optimizer is leveraged to make the choice among the various candidate databases. The data generation is also augmented to provide greater diversity in the represented values. Finally, when a test query is fired, HF-Hydra directs it to the database summary that is expected to provide the highest volumetric similarity. We have experimentally evaluated HF-Hydra on a customized set of queries based on the TPC-DS decision-support benchmark framework. We first evaluated the specialized case where each training query has its own summary, and here HF-Hydra achieves perfect volumetric similarity. Further, each summary construction took just under a second and the summary sizes were just in the order of a few tens of kilobytes. Also, our dynamic generation technique produced gigabytes of data in just a few seconds. For the general setting of a limited set of summaries representing the training query workload, the data generated by HF-Hydra was compared with that from Hydra. We observed that HF-Hydra delivers more than forty percent better accuracy for outputs from filter nodes in the plans, while also achieving an improvement of about twenty percent with regard to join nodes. Further, the degradation in volumetric similarity is minor as compared to the one-summary scenario, while the summary production is significantly more efficient due to reduced overheads on the LP solver. In summary, HF-Hydra takes a substantive step forward with regard to creating expressive, robust, and scalable data regeneration frameworks with immediate relevance to testing deployments.	en_US
dc.language.iso	en_US	en_US
dc.rights	I grant Indian Institute of Science the right to archive and to make available my thesis or dissertation in whole or in part in all forms of media, now hereafter known. I retain all proprietary rights, such as patent rights. I also retain the right to use in future works (such as articles or books) all or part of this thesis or dissertation	en_US
dc.subject	Big Data Management	en_US
dc.subject	Data Summarization	en_US
dc.subject	Data Warehouse	en_US
dc.subject	OLAP Workload	en_US
dc.subject	DBMS Testing	en_US
dc.subject.classification	Research Subject Categories::TECHNOLOGY	en_US
dc.title	Enhancing Coverage and Robustness of Database Generators	en_US
dc.type	Thesis	en_US
dc.degree.name	MTech (Res)	en_US
dc.degree.level	Masters	en_US
dc.degree.grantor	Indian Institute of Science	en_US
dc.degree.discipline	Engineering	en_US

Files in this item

Name:: Enhancing Coverage and Robustness ...
Size:: 2.436Mb
Format:: PDF
Description:: Thesis full text

View/Open

This item appears in the following Collection(s)

Computer Science and Automation (CSA) [545]

Show simple item record