Some Good Papers in SIGMOD 2018 (Research Sessions 915)
In this post, I write my reading report for research papers in SIGMOD 2018. I will go through all the papers I feel interesting and write their basic ideas, methods and my personal evaluations. Please let me know if you find anything inappropriate.
Session 9: Similarity Queries & Estimation

Lightweight Cardinality Estimation in LSMbased Systems
 Summary
 This paper presents a lightweight way of collecting statistics that alleviates the high cost of building synopses by incorporating the statistics accumulation into the common LSMbased database storage layer lifecycle events, e.g., flush, merge.
 They support three kinds of synopses: equiwidth histograms, equiheight histograms and wavelets, only on primary keys (PK) or on secondary keys (SK). To address the issue of antimatter records (e.g., deletion), they construct a separate “anti”synopsis for antomatter records and compute the difference of these two synopses when there is a query. As for the distributed computation, every LSMframework event creates a local synopsis that is sent over the network to the master node.
 The mergeability of synopses is important in the distributed setting, however, only equiwidth histograms can be combined (while wavelets allow merging with loss of accuracy). Since this paper primarily focuses on using the statistics for query optimization, where a silght misestimation could lead to significant errors, the authors choose to keep all statistics, even with mergeable ones as separate entries. To mitigate the cost of querying synopses, the system periodically merges appropriate synopses.
 They used a experimental framework from a paper in SIGMOD 1996 which supports generating synthesized data of many distributions. Aside from those, they also used a reallife dataset of web sever log entries during World Cup 1998. They used four types of range queries: fixedlength, half open, random and point. They measured the overhead and accuracy in the experiments.
 Comments
 This paper is well written with enough background and smooth logic flow. However, the pseudocode of algorithms are not well formatted.
 Summary

Column Sketches: A Scan Accelerator for Rapid and Robust Predicate Evaluation
 Summary
 This paper presents the Column Sketch, a data structure for accelerating scans which uses lossy compression to improve performance regardless of selectivity, datavalue distribution, and data clustering.
 Traditional indexes (e.g., Btrees) don’t perform well with high selectivity; lightweight indexes (e.g., ZoneMap, Column Imprints and Feature Based Data Skipping), which use summary statistics over groups of data to enable data skipping, provide no help when data does not exhibit clustering properties; early pruning methods (e.g., ByteSlicing, BitSlicing and Approximate and Refine) which partition data, decompose the predicate into conjunctions of disjoint subpredicates and use them to skip blocks of data. Thus they propose the column sketch for queries with moderate selectivity over unclustered data.
 The column sketch consists of two parts, a compression map that maps the values in the base data to their assigned codes in the sketched column, and a sketched column that sores the output of compression map. During query, we read the sketched column first and only look at the base data when the predicate on the sketched data is true.
 The objective of compression map is to: (1) assign frequently seen values their own unique code; (2) assign nonunique codes similar number of values; (3) preserve order when necessary; (4) handle unseen values in the domain without reencoding; (5) exploit frequently queried values (optional). For numerical compression maps, they use equidepth histograms with reserved codes for frequent values; for categorical compression maps, they use dictionary encoding with a limit of number of unique codes.
 They use SIMD instructions when doing predicate evaluation over column sketches.
 They compared against an optimized sequential scan, BitWeaving/V, Column Imprints and a Btree index. To eliminate the effects of NUMA, each of the experiments in run on a single socket. They measure the performance in terms of cycles per tuple.
 Comments
 This paper mentions an interesting fact: currently scans outperform Btrees for query selectivities as low as 1%.
 This paper is well written with clear logic flows. It is important to first clarify your design goals then describe the implementation details to let readers better understand your techniques.
 This paper finds a scenario (or a problem setting) where the current approaches don’t work well (i.e., queries with moderate selectivity over unclustered data), then proposes the targeted methods to address it.
 Summary

Overlap Set Similarity Joins with Theoretical Guarantees
 Summary
 This paper presents the solution to the set similarity join problem with overlap constraints. It divides the sets into small and large ones and processes them separately. They propose some optimization heuristics for small sets since there are been existing methods for large sets.
 The size boundary between small and large sets is crucial to the efficiency, they further propose a costbased method to select the size boundary.
 Comments
 This paper is well written and structured.
 Summary
Session 10: Analytical Queries

Efficient kRegret Query Algorithm with Restrictionfree Bound for any Dimensionality
 Summary
 This paper presents an algorithm to solve the kregret query problem, which is a integration of topk and skyline query.
 They proposes the algorithm SPHERE, which is a variation of the kernel algorithm.
 Comments
 This paper is well written as a theory paper. However, I don’t have the patience to carefully read it.
 Summary

A RatingRanking Method for Crowdsourced Topk Computation
 Summary
 This paper presents a ratingranking method for crowdsourced topk computation, which asks either rating or ranking questions to the crowd.
 Rating questions are used to get a rough score for each object, from which the objects with much smaller scores can be pruned; ranking questions are used to refine the scores.
 Comments
 This paper is well written and structured. It uses some statistics and probability methods to estimate the topk objects, and the combination of rating and ranking is novel.
 Summary
Session 13: Machine Learning & Knowledgebase Construction

SketchML: Accelerating Distributed Machine Learning with Data Sketches
 Summary
 This paper presents SketchML, which compresses the sparse and nonuniformdistributed gradient values to better support distributed machine learning (assuming that the communication cost of gradients is nontrivial, i.e. communicationintensive workloads).
 For a sparse gradient vector consisting of keyvalue pairs, they use a sketchbased algorithm (quantile sketch with loss of accuracy) to compress values and a deltabinary encoding method (without loss of accuracy) to compress keys.
 They first convert the gradient values to the bucket indexes by quantile sketch, then use a MinMaxSketch to store the bucket indexes. This is based on the assumption that SGD can still converge with quantification error and underestimated gradients. They also propose separation of positive/negative gradients to avoid reversed gradient (where gradient is actually overestimated) and adaptive learning rate and grouped MinMaxSketch to address vanishing gradient.
 They use datasets from KDD CUP 2010, KDD Cup 2012 to compare against Adam SGD and ZipML on Logistic Regression, SVM and Linear Regression.
 Comments
 This paper is well written and structured. However, some parts are a little bit wordy (e.g., why they didn’t use frequency sketch).
 They also pointed out the limitations of their methods in the paper, I think it is good.
 Summary

MISTIQUE: A System to Store and Query Model Intermediates for Model Diagnosis
 Summary
 This paper presents MISTIQUE, a system that works with traditional ML pipelines and deep neural networks to efficiently capture, store and query model intermediates for diagnosis.
 To store the model intermediates, they use quantization (including lower precision float representation, kbit quantization and thresholdbased quantization), exact and approximate deduplication (e.g., sharing the intermediate results between pipelines/models) and adaptive materialization (only materializing frequentlyqueried intermediates), to reduce the storage footprint.
 They propose a cost model to decide whether to store a intermediate and whether to execute a query by running a model or reading an intermediate.
 They used dataset and pipelines from Kaggle Zestimate competition for traditional ML models, and CIFAR10 with VGG16 and another simple CNN model for DNN models.
 Comments
 This paper is well written and structured.
 The topic of storing intermediate results is very interesting and also important, especially for model diagnosis and AutoML. By smart use of intermediate results, we can save lots of computation and also use them to better guide the model search.
 Summary

A General and Efficient Querying Method for Learning to Hash
 Summary
 This paper presents a new finegrained similarity indicator, quantization distance (QD), to replace the Hamming Rank (HR) used in learning to hash for the approximate nearest neighbors (ANN) search problem. They further develop two efficient querying methods based on QD.
 The quantization distance is defined by the sum of the product of XOR of binary codes and absolute value of the projected vector for each dimension. Intuitively, it measures the minimum change required to the projected vector such that it can be quantized to bucket.
 Comments
 This paper is well written and structured.
 Learning to hash is an interesting topic and the proposed indicator QD is more reasonable than coarsegrained Hamming Rank. It reminds me that sometimes we can replace some intermediate metric (or indicator) to achieve better performance and help algorithm design.
 Summary

DimBoost: Boosting Gradient Boosting Decision Tree to Higher Dimensions
 Summary
 This paper presents DimBoost, a system for training gradient boosting decision tree (GBDT) on highdimensionality data.
 DimBoost uses the parameter server (PS) architecture, where several machines together store a parameter to prevent the singlepoint bottleneck, and provide interfaces for workers to push and pull parameters. Each worker holds a local copy of the parameter, and periodically pushes parameter updates to the PS.
 To speed up histogram construction, they propose a sparsityaware algorithm, and construct it in batch with layerwise parallelism.
 To speed up finding split, they compress the gradient histogram (with loss of precision), use roundrobin based scheduling to assign the task of splitting active nodes to workers, and they use twophase (firstly each server finds the local optimal split and send it to the assigned worker, and secondly the worker finds the global optimal from local optimal splits) to reduce the communication overhead.
 DimBoost is implemented in Java and deployed on Yarn. They use Netty to manage the message passing between physical machines.
 Comments
 This paper is well written and structured.
 It is good to cooperate with companies because they have huge volumes of realworld data and realworld problems to be solved.
 Summary

AutoDetect: DataDriven Error Detection in Tables
 Summary
 This paper presents AutoDetect, a statisticsbased technique that leverages cooccurrence statistics from large corpora for error detection.
 They convert values to patterns using some generalization languages, and detect the singlecolumn error by checking the pointwise mutual information (PMI). By aggregating the scores (PMIs) of different generalization languages together, it give the final prediction for whether it is a error.
 To generate the training data, they use distantsupervision to augment the labeled data to get a large training dataset. And they develop an approximate staticthreshold aggregation algorithm to find thresholds for each generalization algorithm. If any of the generalization languages give a score lower than its threshold, we predict it as incompatible. To reduce the memory footprint, they use sketch (and thus control the memory budget).
 Comments
 This paper is well written and structured.
 The generalization languages still have been to composed by manual. And we have to run the whole framework for each combination of columns.
 If I understand correctly, for each column, they need to find another paired column, I didn’t find anything in the paper about how to find this paired column.
 I am curious about the running time of their algorithms, seems that it would be slow. However, there are no related experiments in the paper.
 Summary
Session 14: Approximate Query Processing

VerdictDB: Universalizing Approximate Query Processing
 Summary
 This paper presents VerdictDB, a middleware rewriting analytical queries to compute an approximate answer and error estimates.
 VerdictDB creates samples offline and rewrites the query to make it execute on the samples, including uniform, hashed, stratified and irregular samples.
 They propose the variational subsampling to enable faster error estimation while retaining the same asymptotic properties.
 Comments
 This paper is well written and structured.
 I don’t think middleware is the right way to go for AQP, it provides good generalization and flexibility, however, it also makes lowlevel optimizations difficult.
 This paper talks about offline sampling, while online sampling is also an important topic.
 Summary

AQP++: Connecting Approximate Query Processing With Aggregate Precomputation for Interactive Analytics
 Summary
 This paper presents AQP++, which connects samplingbased approximate query processing (AQP) and aggregate precomputation (AggPre) and achieves a better tradeoff among preprocessing cost, query response time, and answer quality. The basic idea is that it uses precomputed exact answers as the starting point, and then estimates the difference between the precomputed answer and the real answer by sampling. Therefore, AQP++ subsumes AQP and AggPre.
 Since AQP++ uses AQP to estimate the difference, if we compare the variances of them estimations, when the correlation between the original query and the difference query is big, AQP++ returns a more accurate result.
 To build the precomputed aggregation, they propose blocked prefix cube which computes a small portion of the cells in the traditional prefix cubes. The problem of selecting blocks can be regarded as a optimization problem that minimizes the expected query error. They propose an adaptive hill climbing approach to address this problem.
 To decide which precomputed aggregation to use, they use subsampling to estimate the confidence interval of the difference query, and choose the one with the smallest confidence interval.
 They ran their system on TPCDSkew, BigBench (a synthetic dataset from the Big Data Benchmark) and TLCTrip (from the NYC Taxi and Limousine Commission).
 Comments
 This paper is well written and structured.
 It is always interesting and useful to combine two seemingly orthogonal (but actually connected in some way) ideas together to achieve better overall performance in the system.
 Summary

Random Sampling over Joins Revisited
 Summary
 This paper presents a general join sampling framework that can be combined with any join size upper bound method, which can process general multiway joins (acyclic or cyclic, with or without selection predicates).
 It implements the sampling over a chain of joins by starting from a single root tuple, and joins it with all tuples in the next relation, and with some probability to reject the whole sample (and restart), or get a uniform random tuple from the joined results and continue processing the next relation. These probabilities can be computed by the upper bound of join sizes. By using this method, it returns each join result with equal probability (uniform and independent).
 They used TPCH dataset and a social graph data of twitter friendship links and user profiles. They used KStest to verify their samples are indeed uniform.
 Comments
 This paper is well written and structured.
 Summary
Session 15: Database for Emerging Hardware

Efficient TopK Query Processing on Massively Parallel Hardware
 Summary
 This paper presents several algorithms for topk problem on GPU, including a new algorithm based on bitonic sort.
 The algorithm includes three steps: (1) local sort, it generates sorted sequences of size using partial binotic sort; (2) merge, it bitonically merge two sorted sequences of size ; (3) rebuild, it sorts the sequence with the greater elements and discard the subsequence with the smaller elements. After each merge and rebuild, the size of the problem is halved, and it recursively applies these two steps until elements left.
 They further propose several detailed optimizations about implementing these topk algorithms on GPU.
 They also present a cost model that predicts the performance of these algorithms with respect to , allowing a query optimizer to choose the best top implementation for a particular query.
 Comments
 This paper is well written and structured. It gives a short but useful description of the background knowledge (GPU data access, Sorting on the GPU) with intuitive examples.
 This paper is a mustread for people want to do DBrelated research on GPU.
 Summary
Written on October 15, 2018