Revisiting Statistical Techniques for Result Cardinality Estimation

Shah, Dhrumilkumar

dc.contributor.advisor	Haritsa, Jayant
dc.contributor.author	Shah, Dhrumilkumar
dc.date.accessioned	2021-06-16T10:24:35Z
dc.date.available	2021-06-16T10:24:35Z
dc.date.submitted	2021
dc.identifier.uri	https://etd.iisc.ac.in/handle/2005/5163
dc.description.abstract	The Relational Database Management Systems (RDBMS) constitute the backbone of today's information-rich society, providing a congenial environment for handling enterprise data during its entire life cycle of generation, storage, maintenance, and processing. The Structured Query Language (SQL) is the standard interface to query the information present in RDBMS-based storage. Knowing the expected size of the SQL query result, measured in terms of the output row-cardinality, prior to execution can benefit both the RDBMS system and the user in several ways. The use-cases include assessing query feasibility, approximate query answering, query progress monitoring, and resource allocation strategies. In the context of our work, we define cardinality estimation as the estimation of the result size (number of rows in the output) of the given SQL query. Unfortunately, the histogram and sampling-based techniques commonly used in industrial database engines for cardinality estimation are often woefully inaccurate in practice. This lacuna has motivated a recent slew of papers advocating the use of machine-learning/deep-learning techniques for cardinality estimation. However, these new approaches have their own limitations regarding significant training effort, inability to handle dynamic data-updates, and generalization to unseen queries. In this work, we take a relook at the traditional random sampling and investigate whether they can be made to work satisfactorily when augmented with lightweight data structures. Specifically, we present GridSam – a Grid-based Dynamic Sampling technique, which essentially augments random sampling with histograms, incorporating both algorithmic and platform innovations. From the algorithmic perspective, GridSam first creates a multi-dimensional grid overlay by partitioning the data-space on “critical” attributes, and then performs dynamic sampling from the confined query-specific region of the grid to capture correlations. A greedy methodology targeted towards reducing the Zero Sample Problem occurrence is used to determine the set of “critical” attributes as the grid dimensions. Further, insights from the Index-based Join Sampling (IBJS) technique are leveraged to direct the sampling in multi-table queries. Finally, learned-indexes are incorporated to reduce the index-probing cost for join sampling during the estimation process. From the platform perspective, GridSam leverages the massive parallelism offered by current GPU architectures to provide fast grid setup times. This parallelism is also extended to the run-time estimation process. A detailed performance study on benchmark environments indicates that GridSam computes cardinality estimates with accuracies competitive to contemporary learning-based techniques. Moreover, it does so while achieving an orders-of-magnitude reduction in setup time. Further, the estimation time is in the same ballpark as both traditional and learning-based techniques. Finally, a collateral benefit of GridSam's simple and highly parallelizable design is that, unlike learned estimators, it is amenable to dynamic data environments with frequent data-updates.	en_US
dc.language.iso	en_US	en_US
dc.rights	I grant Indian Institute of Science the right to archive and to make available my thesis or dissertation in whole or in part in all forms of media, now hereafter known. I retain all proprietary rights, such as patent rights. I also retain the right to use in future works (such as articles or books) all or part of this thesis or dissertation	en_US
dc.subject	Cardinality Estimation	en_US
dc.subject	Relational Database Management System	en_US
dc.subject	Cardinality Estimation in Database Systems	en_US
dc.subject.classification	Research Subject Categories::TECHNOLOGY::Information technology::Computer science	en_US
dc.title	Revisiting Statistical Techniques for Result Cardinality Estimation	en_US
dc.type	Thesis	en_US
dc.degree.name	MTech (Res)	en_US
dc.degree.level	Masters	en_US
dc.degree.grantor	Indian Institute of Science	en_US
dc.degree.discipline	Engineering	en_US

Files in this item

Name:: dhrumil_thesis.pdf
Size:: 1.774Mb
Format:: PDF
Description:: Thesis full text

View/Open

This item appears in the following Collection(s)

Computer Science and Automation (CSA) [545]

Show simple item record