Hoeffding Inequalities for Online Aggregation
Keywords: Database, aggregation, Hoeffding inequality
Abstract: The online aggregation (OA) system recently proposed by Hellerstein et al. extends current relational database management systems to permit interactive exploration of large, complex datasets. To explore the dataset, the user submits a sequence of "aggregation queries". Computation of an aggregation query requires application of a complex sequence of relational selection and join operators to the base tables maintained by the system to create an output table. Each record in the output table is then mapped to a real number, and summary statistics are computed from the resulting number set. An OA system reads each base table in random order and computes a running estimate of the query result, along with running confidence intervals. Such intervals are a key component of an OA system and indicate to the user the estimated proximity of each running aggregate to the corresponding final result. Haas and Hellerstein previously developed a large-sample confidence interval methodology based on central limit theorems (CLT's) for "cross-product averages". However, such intervals cannot be used at the beginning of processing because there are not enough sampled records to apply the CLT, or at the later stages of processing because the set of running statistics required to compute the confidence interval becomes too large. In this paper, we show how new and existing Hoeffding-type inequalities can be used to avoid these problems and provide conservative running confidence intervals for OA query processing.