HYDRA: A Dynamic Approach to Database Regeneration
Abstract
Database software vendors often need to generate synthetic databases for a variety of applications, including (a) Testing database engines and applications, (b) Data masking, (c) Benchmarking, (d) Creating what-if scenarios, and (e) Assessing performance impacts of planned engine upgrades. The synthetic databases are targeted toward capturing the desired schematic properties (e.g., keys, referential constraints, functional dependencies, domain constraints), as well as the statistical data profiles (e.g., value distributions, column correlations, data skew, output volumes) hosted on these schemas.
Several data generation frameworks have been proposed for OLAP over the past three decades. The early efforts focused on ab initio generation based on standard mathematical distributions. Subsequently, there was a shift to database-dependent regeneration, which aims to create a database with similar statistical properties to a specific client database. However, these mechanisms could not mimic the customer query-processing environments satisfactorily. The contemporary school of thought generates workload-aware data that uses query execution plans from the customer workloads as input and guarantees volumetric similarity. That is, the intermediate row cardinalities obtained at the client and vendor sites are very similar when matching query plans are executed. This similarity helps to preserve the multi-dimensional layout and flow of the data, a prerequisite for achieving similar performance on the client’s workload. However, even in this category, the existing frameworks are hampered by limitations such as the inability to (a) provide a comprehensive algorithm to handle the queries based on core relational algebra operators, namely, Select, Project, and Join; (b) scale to big data volumes; (c) scale to large input workloads; and (d) provide high accuracy on unseen queries.
In this work, motivated by the above lacunae, we present HYDRA, a data regeneration tool that materially addresses the above challenges by adding functionality, dynamism, scale, and robustness. Firstly, extended workload coverage is provided through a comprehensive solution for modeling select-project-join relational algebra operators. Specifically, the constraints are represented as a linear feasibility problem, in which each variable represents the volume of a partitioned region of the data space. Our partitioning scheme for filter constraints permits the regions to be non-convex and ensures the minimum number of regions, thereby hugely reducing the problem complexity as compared to the rectangular grid-partitioning advocated in the prior literature. Similarly, our projection subspace division and projection isolation strategies address the critical challenge of capturing unions, as opposed to summations, in incorporating projection constraints. Finally, by creating referential constraints over denormalized equivalents of the tables, Hydra delivers a comprehensive solution that also handles join constraints.
Secondly, a unique feature of our data regeneration approach is that it delivers a database summary as the output rather than the static data itself. This summary is of negligible size and depends only on the query workload and not on the database scale. It can be used for dynamically generating data during query execution. Therefore, the enormous time and space overheads incurred by prior techniques in generating and storing the data before initiating analysis are eliminated. Our experience is that the summaries for complex Big Data client scenarios comprising over a hundred queries are constructed within just a few minutes, requiring only a few MBs of storage.
Thirdly, to improve accuracy towards unseen queries, Hydra additionally exploits metadata statistics maintained by the database engine. Specifically, it adds an objective function to the linear program to pick a solution with improved inter-region tuple distribution. Further, a uniform distribution of tuples within regions is modeled to obtain a spread of values. These techniques facilitate the careful selection of a desirable database from the candidate synthetic databases, and also provide metadata compliance.
The proposed ideas have been evaluated on the TPC-DS synthetic benchmark, as well as real-world benchmarks based on the Census and IMDB databases. Further, the Hydra framework has been prototyped in a Java-based tool that provides a visual and interactive demonstration of the data regeneration pipeline. The tool has been warmly received by both academic and industrial communities.