HYDRA: A Dynamic Approach to Database Regeneration

Sanghi, Anupam

dc.contributor.advisor	Haritsa, Jayant R
dc.contributor.author	Sanghi, Anupam
dc.date.accessioned	2022-12-21T06:23:27Z
dc.date.available	2022-12-21T06:23:27Z
dc.date.submitted	2022
dc.identifier.uri	https://etd.iisc.ac.in/handle/2005/5959
dc.description.abstract	Database software vendors often need to generate synthetic databases for a variety of applications, including (a) Testing database engines and applications, (b) Data masking, (c) Benchmarking, (d) Creating what-if scenarios, and (e) Assessing performance impacts of planned engine upgrades. The synthetic databases are targeted toward capturing the desired schematic properties (e.g., keys, referential constraints, functional dependencies, domain constraints), as well as the statistical data profiles (e.g., value distributions, column correlations, data skew, output volumes) hosted on these schemas. Several data generation frameworks have been proposed for OLAP over the past three decades. The early efforts focused on ab initio generation based on standard mathematical distributions. Subsequently, there was a shift to database-dependent regeneration, which aims to create a database with similar statistical properties to a specific client database. However, these mechanisms could not mimic the customer query-processing environments satisfactorily. The contemporary school of thought generates workload-aware data that uses query execution plans from the customer workloads as input and guarantees volumetric similarity. That is, the intermediate row cardinalities obtained at the client and vendor sites are very similar when matching query plans are executed. This similarity helps to preserve the multi-dimensional layout and flow of the data, a prerequisite for achieving similar performance on the client’s workload. However, even in this category, the existing frameworks are hampered by limitations such as the inability to (a) provide a comprehensive algorithm to handle the queries based on core relational algebra operators, namely, Select, Project, and Join; (b) scale to big data volumes; (c) scale to large input workloads; and (d) provide high accuracy on unseen queries. In this work, motivated by the above lacunae, we present HYDRA, a data regeneration tool that materially addresses the above challenges by adding functionality, dynamism, scale, and robustness. Firstly, extended workload coverage is provided through a comprehensive solution for modeling select-project-join relational algebra operators. Specifically, the constraints are represented as a linear feasibility problem, in which each variable represents the volume of a partitioned region of the data space. Our partitioning scheme for filter constraints permits the regions to be non-convex and ensures the minimum number of regions, thereby hugely reducing the problem complexity as compared to the rectangular grid-partitioning advocated in the prior literature. Similarly, our projection subspace division and projection isolation strategies address the critical challenge of capturing unions, as opposed to summations, in incorporating projection constraints. Finally, by creating referential constraints over denormalized equivalents of the tables, Hydra delivers a comprehensive solution that also handles join constraints. Secondly, a unique feature of our data regeneration approach is that it delivers a database summary as the output rather than the static data itself. This summary is of negligible size and depends only on the query workload and not on the database scale. It can be used for dynamically generating data during query execution. Therefore, the enormous time and space overheads incurred by prior techniques in generating and storing the data before initiating analysis are eliminated. Our experience is that the summaries for complex Big Data client scenarios comprising over a hundred queries are constructed within just a few minutes, requiring only a few MBs of storage. Thirdly, to improve accuracy towards unseen queries, Hydra additionally exploits metadata statistics maintained by the database engine. Specifically, it adds an objective function to the linear program to pick a solution with improved inter-region tuple distribution. Further, a uniform distribution of tuples within regions is modeled to obtain a spread of values. These techniques facilitate the careful selection of a desirable database from the candidate synthetic databases, and also provide metadata compliance. The proposed ideas have been evaluated on the TPC-DS synthetic benchmark, as well as real-world benchmarks based on the Census and IMDB databases. Further, the Hydra framework has been prototyped in a Java-based tool that provides a visual and interactive demonstration of the data regeneration pipeline. The tool has been warmly received by both academic and industrial communities.	en_US
dc.language.iso	en_US	en_US
dc.rights	I grant Indian Institute of Science the right to archive and to make available my thesis or dissertation in whole or in part in all forms of media, now hereafter known. I retain all proprietary rights, such as patent rights. I also retain the right to use in future works (such as articles or books) all or part of this thesis or dissertation	en_US
dc.subject	Database Testing	en_US
dc.subject	Data Generation	en_US
dc.subject	OLAP Systems	en_US
dc.subject	Query Processing	en_US
dc.subject	Synthetic Database Generation	en_US
dc.subject.classification	Research Subject Categories::TECHNOLOGY::Information technology::Computer science	en_US
dc.title	HYDRA: A Dynamic Approach to Database Regeneration	en_US
dc.type	Thesis	en_US
dc.degree.name	PhD	en_US
dc.degree.level	Doctoral	en_US
dc.degree.grantor	Indian Institute of Science	en_US
dc.degree.discipline	Engineering	en_US

Files in this item

Name:: Anupam_Thesis.pdf
Size:: 4.804Mb
Format:: PDF
Description:: Thesis full text

View/Open

This item appears in the following Collection(s)

Computer Science and Automation (CSA) [394]

Show simple item record