## Some Good Papers in SIGMOD 2018 (Research Sessions 9-15)

In this post, I write my reading report for research papers in SIGMOD 2018. I will go through all the papers I feel interesting and write their basic ideas, methods and my personal evaluations. Please let me know if you find anything inappropriate.

## Session 9: Similarity Queries & Estimation

• ### Lightweight Cardinality Estimation in LSM-based Systems

• Summary
1. This paper presents a light-weight way of collecting statistics that alleviates the high cost of building synopses by incorporating the statistics accumulation into the common LSM-based database storage layer lifecycle events, e.g., flush, merge.
2. They support three kinds of synopses: equi-width histograms, equi-height histograms and wavelets, only on primary keys (PK) or on secondary keys (SK). To address the issue of anti-matter records (e.g., deletion), they construct a separate “anti”-synopsis for anto-matter records and compute the difference of these two synopses when there is a query. As for the distributed computation, every LSM-framework event creates a local synopsis that is sent over the network to the master node.
3. The mergeability of synopses is important in the distributed setting, however, only equi-width histograms can be combined (while wavelets allow merging with loss of accuracy). Since this paper primarily focuses on using the statistics for query optimization, where a silght mis-estimation could lead to significant errors, the authors choose to keep all statistics, even with mergeable ones as separate entries. To mitigate the cost of querying synopses, the system periodically merges appropriate synopses.
4. They used a experimental framework from a paper in SIGMOD 1996 which supports generating synthesized data of many distributions. Aside from those, they also used a real-life dataset of web sever log entries during World Cup 1998. They used four types of range queries: fixed-length, half open, random and point. They measured the overhead and accuracy in the experiments.
1. This paper is well written with enough background and smooth logic flow. However, the pseudocode of algorithms are not well formatted.
• ### Column Sketches: A Scan Accelerator for Rapid and Robust Predicate Evaluation

• Summary
1. This paper presents the Column Sketch, a data structure for accelerating scans which uses lossy compression to improve performance regardless of selectivity, data-value distribution, and data clustering.
2. Traditional indexes (e.g., B-trees) don’t perform well with high selectivity; lightweight indexes (e.g., ZoneMap, Column Imprints and Feature Based Data Skipping), which use summary statistics over groups of data to enable data skipping, provide no help when data does not exhibit clustering properties; early pruning methods (e.g., Byte-Slicing, Bit-Slicing and Approximate and Refine) which partition data, decompose the predicate into conjunctions of disjoint sub-predicates and use them to skip blocks of data. Thus they propose the column sketch for queries with moderate selectivity over unclustered data.
3. The column sketch consists of two parts, a compression map that maps the values in the base data to their assigned codes in the sketched column, and a sketched column that sores the output of compression map. During query, we read the sketched column first and only look at the base data when the predicate on the sketched data is true.
4. The objective of compression map is to: (1) assign frequently seen values their own unique code; (2) assign non-unique codes similar number of values; (3) preserve order when necessary; (4) handle unseen values in the domain without re-encoding; (5) exploit frequently queried values (optional). For numerical compression maps, they use equi-depth histograms with reserved codes for frequent values; for categorical compression maps, they use dictionary encoding with a limit of number of unique codes.
5. They use SIMD instructions when doing predicate evaluation over column sketches.
6. They compared against an optimized sequential scan, BitWeaving/V, Column Imprints and a B-tree index. To eliminate the effects of NUMA, each of the experiments in run on a single socket. They measure the performance in terms of cycles per tuple.
1. This paper mentions an interesting fact: currently scans outperform B-trees for query selectivities as low as 1%.
2. This paper is well written with clear logic flows. It is important to first clarify your design goals then describe the implementation details to let readers better understand your techniques.
3. This paper finds a scenario (or a problem setting) where the current approaches don’t work well (i.e., queries with moderate selectivity over unclustered data), then proposes the targeted methods to address it.
• ### Overlap Set Similarity Joins with Theoretical Guarantees

• Summary
1. This paper presents the solution to the set similarity join problem with overlap constraints. It divides the sets into small and large ones and processes them separately. They propose some optimization heuristics for small sets since there are been existing methods for large sets.
2. The size boundary between small and large sets is crucial to the efficiency, they further propose a cost-based method to select the size boundary.
1. This paper is well written and structured.

## Session 10: Analytical Queries

• ### Efficient k-Regret Query Algorithm with Restriction-free Bound for any Dimensionality

• Summary
1. This paper presents an algorithm to solve the k-regret query problem, which is a integration of top-k and skyline query.
2. They proposes the algorithm SPHERE, which is a variation of the $\epsilon$-kernel algorithm.
1. This paper is well written as a theory paper. However, I don’t have the patience to carefully read it.
• ### A Rating-Ranking Method for Crowdsourced Top-k Computation

• Summary
1. This paper presents a rating-ranking method for crowdsourced top-k computation, which asks either rating or ranking questions to the crowd.
2. Rating questions are used to get a rough score for each object, from which the objects with much smaller scores can be pruned; ranking questions are used to refine the scores.
1. This paper is well written and structured. It uses some statistics and probability methods to estimate the top-k objects, and the combination of rating and ranking is novel.

## Session 13: Machine Learning & Knowledge-base Construction

• ### SketchML: Accelerating Distributed Machine Learning with Data Sketches

• Summary
1. This paper presents SketchML, which compresses the sparse and nonuniform-distributed gradient values to better support distributed machine learning (assuming that the communication cost of gradients is nontrivial, i.e. communication-intensive workloads).
2. For a sparse gradient vector consisting of key-value pairs, they use a sketch-based algorithm (quantile sketch with loss of accuracy) to compress values and a delta-binary encoding method (without loss of accuracy) to compress keys.
3. They first convert the gradient values to the bucket indexes by quantile sketch, then use a MinMaxSketch to store the bucket indexes. This is based on the assumption that SGD can still converge with quantification error and underestimated gradients. They also propose separation of positive/negative gradients to avoid reversed gradient (where gradient is actually overestimated) and adaptive learning rate and grouped MinMaxSketch to address vanishing gradient.
4. They use datasets from KDD CUP 2010, KDD Cup 2012 to compare against Adam SGD and ZipML on Logistic Regression, SVM and Linear Regression.
1. This paper is well written and structured. However, some parts are a little bit wordy (e.g., why they didn’t use frequency sketch).
2. They also pointed out the limitations of their methods in the paper, I think it is good.
• ### MISTIQUE: A System to Store and Query Model Intermediates for Model Diagnosis

• Summary
1. This paper presents MISTIQUE, a system that works with traditional ML pipelines and deep neural networks to efficiently capture, store and query model intermediates for diagnosis.
2. To store the model intermediates, they use quantization (including lower precision float representation, k-bit quantization and threshold-based quantization), exact and approximate de-duplication (e.g., sharing the intermediate results between pipelines/models) and adaptive materialization (only materializing frequently-queried intermediates), to reduce the storage footprint.
3. They propose a cost model to decide whether to store a intermediate and whether to execute a query by running a model or reading an intermediate.
4. They used dataset and pipelines from Kaggle Zestimate competition for traditional ML models, and CIFAR10 with VGG16 and another simple CNN model for DNN models.
1. This paper is well written and structured.
2. The topic of storing intermediate results is very interesting and also important, especially for model diagnosis and AutoML. By smart use of intermediate results, we can save lots of computation and also use them to better guide the model search.
• ### A General and Efficient Querying Method for Learning to Hash

• Summary
1. This paper presents a new fine-grained similarity indicator, quantization distance (QD), to replace the Hamming Rank (HR) used in learning to hash for the approximate nearest neighbors (ANN) search problem. They further develop two efficient querying methods based on QD.
2. The quantization distance is defined by the sum of the product of XOR of binary codes and absolute value of the projected vector for each dimension. Intuitively, it measures the minimum change required to the projected vector such that it can be quantized to bucket.
1. This paper is well written and structured.
2. Learning to hash is an interesting topic and the proposed indicator QD is more reasonable than coarse-grained Hamming Rank. It reminds me that sometimes we can replace some intermediate metric (or indicator) to achieve better performance and help algorithm design.
• ### DimBoost: Boosting Gradient Boosting Decision Tree to Higher Dimensions

• Summary
1. This paper presents DimBoost, a system for training gradient boosting decision tree (GBDT) on high-dimensionality data.
2. DimBoost uses the parameter server (PS) architecture, where several machines together store a parameter to prevent the single-point bottleneck, and provide interfaces for workers to push and pull parameters. Each worker holds a local copy of the parameter, and periodically pushes parameter updates to the PS.
3. To speed up histogram construction, they propose a sparsity-aware algorithm, and construct it in batch with layer-wise parallelism.
4. To speed up finding split, they compress the gradient histogram (with loss of precision), use round-robin based scheduling to assign the task of splitting active nodes to workers, and they use two-phase (firstly each server finds the local optimal split and send it to the assigned worker, and secondly the worker finds the global optimal from local optimal splits) to reduce the communication overhead.
5. DimBoost is implemented in Java and deployed on Yarn. They use Netty to manage the message passing between physical machines.
1. This paper is well written and structured.
2. It is good to cooperate with companies because they have huge volumes of real-world data and real-world problems to be solved.
• ### Auto-Detect: Data-Driven Error Detection in Tables

• Summary
1. This paper presents Auto-Detect, a statistics-based technique that leverages co-occurrence statistics from large corpora for error detection.
2. They convert values to patterns using some generalization languages, and detect the single-column error by checking the point-wise mutual information (PMI). By aggregating the scores (PMIs) of different generalization languages together, it give the final prediction for whether it is a error.
3. To generate the training data, they use distant-supervision to augment the labeled data to get a large training dataset. And they develop an approximate static-threshold aggregation algorithm to find thresholds for each generalization algorithm. If any of the generalization languages give a score lower than its threshold, we predict it as incompatible. To reduce the memory footprint, they use sketch (and thus control the memory budget).
1. This paper is well written and structured.
2. The generalization languages still have been to composed by manual. And we have to run the whole framework for each combination of columns.
3. If I understand correctly, for each column, they need to find another paired column, I didn’t find anything in the paper about how to find this paired column.
4. I am curious about the running time of their algorithms, seems that it would be slow. However, there are no related experiments in the paper.

## Session 14: Approximate Query Processing

• ### VerdictDB: Universalizing Approximate Query Processing

• Summary
1. This paper presents VerdictDB, a middleware rewriting analytical queries to compute an approximate answer and error estimates.
2. VerdictDB creates samples offline and rewrites the query to make it execute on the samples, including uniform, hashed, stratified and irregular samples.
3. They propose the variational sub-sampling to enable faster error estimation while retaining the same asymptotic properties.
1. This paper is well written and structured.
2. I don’t think middleware is the right way to go for AQP, it provides good generalization and flexibility, however, it also makes low-level optimizations difficult.
3. This paper talks about offline sampling, while online sampling is also an important topic.
• ### AQP++: Connecting Approximate Query Processing With Aggregate Precomputation for Interactive Analytics

• Summary
1. This paper presents AQP++, which connects sampling-based approximate query processing (AQP) and aggregate precomputation (AggPre) and achieves a better trade-off among preprocessing cost, query response time, and answer quality. The basic idea is that it uses precomputed exact answers as the starting point, and then estimates the difference between the precomputed answer and the real answer by sampling. Therefore, AQP++ subsumes AQP and AggPre.
2. Since AQP++ uses AQP to estimate the difference, if we compare the variances of them estimations, when the correlation between the original query and the difference query is big, AQP++ returns a more accurate result.
3. To build the pre-computed aggregation, they propose blocked prefix cube which computes a small portion of the cells in the traditional prefix cubes. The problem of selecting blocks can be regarded as a optimization problem that minimizes the expected query error. They propose an adaptive hill climbing approach to address this problem.
4. To decide which pre-computed aggregation to use, they use sub-sampling to estimate the confidence interval of the difference query, and choose the one with the smallest confidence interval.
5. They ran their system on TPCD-Skew, BigBench (a synthetic dataset from the Big Data Benchmark) and TLCTrip (from the NYC Taxi and Limousine Commission).
1. This paper is well written and structured.
2. It is always interesting and useful to combine two seemingly orthogonal (but actually connected in some way) ideas together to achieve better overall performance in the system.
• ### Random Sampling over Joins Revisited

• Summary
1. This paper presents a general join sampling framework that can be combined with any join size upper bound method, which can process general multi-way joins (acyclic or cyclic, with or without selection predicates).
2. It implements the sampling over a chain of joins by starting from a single root tuple, and joins it with all tuples in the next relation, and with some probability to reject the whole sample (and restart), or get a uniform random tuple from the joined results and continue processing the next relation. These probabilities can be computed by the upper bound of join sizes. By using this method, it returns each join result with equal probability (uniform and independent).
3. They used TPC-H dataset and a social graph data of twitter friendship links and user profiles. They used KStest to verify their samples are indeed uniform.
1. This paper is well written and structured.

## Session 15: Database for Emerging Hardware

• ### Efficient Top-K Query Processing on Massively Parallel Hardware

• Summary
1. This paper presents several algorithms for top-k problem on GPU, including a new algorithm based on bitonic sort.
2. The algorithm includes three steps: (1) local sort, it generates sorted sequences of size $k$ using partial binotic sort; (2) merge, it bitonically merge two sorted sequences of size $k$; (3) rebuild, it sorts the sequence with the greater $k$ elements and discard the subsequence with the smaller $k$ elements. After each merge and rebuild, the size of the problem is halved, and it recursively applies these two steps until $k$ elements left.
3. They further propose several detailed optimizations about implementing these top-k algorithms on GPU.
4. They also present a cost model that predicts the performance of these algorithms with respect to $k$, allowing a query optimizer to choose the best top-$k$ implementation for a particular query.