BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large DataReference: Agarwal, Sameer, et al. "BlinkDB: queries with bounded errors and bounded response times on very large data." Proceedings of the 8th ACM European Conference on Computer Systems. ACM, 2013.
In this paper, we present BlinkDB, a massively parallel, sampling-based approximate query engine for running adhoc, interactive SQL queries on large volumes of data. The key insight that BlinkDB builds on is that one can often make reasonable decisions in the absence of perfect answers. For example, reliably detecting a malfunctioning server using a distributed collection of system logs does not require analyzing every request processed by the system. Based on this insight, BlinkDB allows one to trade-off query accuracy for response time, enabling interactive queries over massive data by running queries on data samples and presenting results annotated with meaningful error bars. To achieve this, BlinkDB uses two key ideas that differentiate it from previous work in this area: (1) an adaptive optimization framework that builds and maintains a set of multi-dimensional, multi-resolution samples from original data over time, and (2) a dynamic sample selection strategy that selects an appropriately sized sample based on a query’s accuracy and/or response time requirements. We have built an open-source version of BlinkDB and validated its effectiveness using the well-known TPC-H benchmark as well as a real-world analytic workload derived from Conviva Inc. Our experiments on a 100 node cluster show that BlinkDB can answer a wide range of queries from a real-world query trace on up to 17 TBs of data in less than 2 seconds (over 100x faster than Hive), within an error of 2 - 10%.
- Over the past two decades a large number of approximation techniques have been proposed, which allow for fast processing of large amounts of data by trading result accuracy for response time and space, including sampling, sketches, and on-line aggregation.
- Existing sampling and sketch methods exhibit low space and time complexity, but typically make strong assumptions about the query workloads; on-line aggregation (OLA) make fewer assumptions about the query workload, at the expense of highly variable performance.
2.1 Workload Taxonomy
From low flexibility / high efficiency to high flexibility / low efficiency:
- Predictable Queries
- Predictable Query Predicates
- Predictable QCSs
- Unpredictable Queries
2.2 Query Patterns in a Production Cluster
- QCSs cover most queries and are stable over time.
3. System Overview
3.1 Supported Queries
- Either specify error bound with confidence level or response time limit.
- BlinkDB doesn’t support arbitrary joins and nested SQL queries, while it supports joining a large, sampled fact table with smaller tables that are small enough to fit in the main memory of any single node in the cluster.
4. Sample Creation
4.1 Stratified Sample
Uniform samples does not work well for queries on filtered or grouped subsets of the tables. The standard approach to solving this problem is stratified sampling, which ensures that rare subgroups are sufficiently represented.
Optimizing a stratified sample for a single query
- For error bound or time bound, it essentially needs to find the optimal sample size n.
- To avoid miss or under-represent groups, we assign equal sample size to each groups.
- For cap size K (the maximum size for each group), the aggregate operators have standard error inversely proportional to
Optimizing a set of stratified samples for all queries sharing a QCS
- For different queries on the same CQS, we may have different sample size n. Since for smaller sample size, we could just get a subset of samples with bigger sample size, we could maintain a big sample (with big sample size) to avoid storage overhead.
4.2 Optimization Framework
5. BlinkDB Runtime
5.1 Selecting the Sample
- If there are some column sets which are superset of queried columns, choose the one with the smallest number of columns. If not, choose the column sets with high selectivity (selected/read by the query).
5.2 Selecting the Right Sample/Size
- Error Profile: BlinkDB estimates the query selectivity, sample variance (for SUM/AVG) and the input data distribution by running the query on a number of small sample subsets, and using these estimates, we calculate the number of rows required to meet Q’s error constraint using standard closed form statistical error estimates.
- Latency Profile: BlinkDB simply predicts the number of rows by assuming that latency scaled linearly with input size, as is commonly observed with a majority of I/O bounded queries in parallel distributed execution environments. To avoid non-linearities that may arise when running on very small in-memory samples, BlinkDB runs a few smaller samples until performance seems to grow linearly and then estimates the appropriate linear scaling constants for the model.
5.3 An Example
5.4 Bias Correction
- Since we use stratified sampling, some popular subgroups will only have a small fraction of values represented, therefore BlinkDB keeps track of the effective sampling rate for each group associated with each sample in a hidden column as part of the sample table schema, and uses this to weight different subgroups to produce an unbiased result.
- BlinkDB is built on top of Hive, supports both Hadoop MapReduce and Spark as the execution layer and uses HDFS at the storage layer.
- We address correlation among query results by periodically replacing the set of samples used. This could also be used to address the change of workload over time.
- 100 node EC2 cluster
- Workloads: Conviva Inc. and TPC-H
7.1 Evaluation Setting
7.2 BlinkDB vs. No Sampling
- Comparison against Hive on MapReduce, Hive on Spark (Shark), both with and without cache.
7.3 Multi-Dimensional Stratified Sampling
- Uniform sampling, single-dimensional stratified sample (just one column), multi-dimensional stratified sample (no more than three columns)
7.4 Scaling Up
8. Related Work
- Sampling Approaches
- Online Aggregation: the main disadvantage of OLA systems is that they stream data in a random order, which imposes a significant overhead in terms of I/O.
- Materialized Views, Data Cubes, Wavelets, Synopses, Sketches, Histograms.