Revisiting Statistical Techniques for Result Cardinality Estimation
Abstract
The Relational Database Management Systems (RDBMS) constitute the backbone of today's information-rich society, providing a congenial environment for handling enterprise data during its entire life cycle of generation, storage, maintenance, and processing. The Structured Query Language (SQL) is the standard interface to query the information present in RDBMS-based storage. Knowing the expected size of the SQL query result, measured in terms of the output row-cardinality, prior to execution can benefit both the RDBMS system and the user in several ways. The use-cases include assessing query feasibility, approximate query answering, query progress monitoring, and resource allocation strategies. In the context of our work, we define cardinality estimation as the estimation of the result size (number of rows in the output) of the given SQL query.
Unfortunately, the histogram and sampling-based techniques commonly used in industrial database engines for cardinality estimation are often woefully inaccurate in practice. This lacuna has motivated a recent slew of papers advocating the use of machine-learning/deep-learning techniques for cardinality estimation. However, these new approaches have their own limitations regarding significant training effort, inability to handle dynamic data-updates, and generalization to unseen queries.
In this work, we take a relook at the traditional random sampling and investigate whether they can be made to work satisfactorily when augmented with lightweight data structures.
Specifically, we present GridSam – a Grid-based Dynamic Sampling technique, which essentially augments random sampling with histograms, incorporating both algorithmic and platform innovations.
From the algorithmic perspective, GridSam first creates a multi-dimensional grid overlay by partitioning the data-space on “critical” attributes, and then performs dynamic sampling from the confined query-specific region of the grid to capture correlations. A greedy methodology targeted towards reducing the Zero Sample Problem occurrence is used to determine the set of “critical” attributes as the grid dimensions. Further, insights from the Index-based Join Sampling (IBJS) technique are leveraged to direct the sampling in multi-table queries. Finally, learned-indexes are incorporated to reduce the index-probing cost for join sampling during the estimation process.
From the platform perspective, GridSam leverages the massive parallelism offered by current GPU architectures to provide fast grid setup times. This parallelism is also extended to the run-time estimation process.
A detailed performance study on benchmark environments indicates that GridSam computes cardinality estimates with accuracies competitive to contemporary learning-based techniques. Moreover, it does so while achieving an orders-of-magnitude reduction in setup time. Further, the estimation time is in the same ballpark as both traditional and learning-based techniques. Finally, a collateral benefit of GridSam's simple and highly parallelizable design is that, unlike learned estimators, it is amenable to dynamic data environments with frequent data-updates.