Some Good Papers in SIGMOD 2018 (Research Sessions 1-8)

In this post, I write my reading report for research papers in SIGMOD 2018. I will go through all the papers I feel interesting and write their basic ideas, methods and my personal evaluations. Please let me know if you find anything inappropriate.

Session 1: Data Integration & Cleaning

  • Deep Learning for Entity Matching: A Design Space Exploration

    • Summary
      1. This paper proposes a categorization of DL solutions in EM: (1) attribute embedding; (2) attribute similarity representation: attribute summarization and attribute comparison; (3) classifier.
      2. This paper considers three types of EM problems: structured, textual and dirty.
      3. DL is competitive with state-of-the-art on structured instances with far longer training time, while it outperforms state-of-the-art on textual and dirty instances.
    • Comments
      1. This paper is well-written and provides a good summary of DL solutions in EM. It proposes the design space for DL in EM and summarizes the problem types in EM, i.e., both the problem and solutions are well-defined.
      2. This paper is a good example of how to apply DL in database research: categorizing the problem types, summarizing design space, applying DL and doing empirical evaluations.
      3. It is really important to provide heuristics for the design choices, discussions for why the methods are better, goals and takeaways for the experiments.

Session 2: Usability & Security/Privacy

  • The Data Interaction Game

    • Summary
      1. Users actually interact with DBMSs during which they learn and modify how to express their information needs, which is rarely captured in current query interfaces, i.e., the user feedbacks are not well utilized or user strategies are assumed fixed. Therefore this paper models this as a game with identical interest between two rational agents whose goal is to establish a common language for representing information needs in form of queries. It further proposes a reinforcement learning method to explore this model.
      2. This paper firstly uses a reinforcement learning algorithm (Roth and Erev’s model) to imitate and model the user behavior (justified by empirical analysis), then uses the same reinforcement algorithm to train the interactions between users and DBMSs.
    • Comments
      1. This paper is too dry to read, probably because it contains too much theoretical stuff and few illustrative examples. Probably I will read back when I have an better knowledge of reinforcement learning.
      2. One thing I don’t understand is that they model the user behaviors with a reinforcement learning algorithm, and then they train the interactions with the same algorithm. Since the user behavior is already modeled, it doesn’t make sense for me that reinforcement learning can model the interactions in real world. Although I understand that collecting user feedbacks for training the interactions are difficult, this algorithm still seems making too strong assumptions and not very practical.

Session 3: Transactions & Indexing

    • Summary
      1. Most storage systems support geo-distributed and multi-partitioned transactions by layering transaction management, such as concurrency control and two-phase commit (2PC), on top of a consensus protocol, which incurs high latency to commit a transaction because it sequentially executes the layered protocols. This paper addresses this by targeting 2-round Fixed-set Interactive (2FI) transactions (where each transaction consists of a round of reads followed by a round of writes with read and write keys that are known at the start of the transaction) to enable transaction processing to overlap with 2PC and state replication.
      2. By knowing the read and write sets (the assumption of 2FI transactions), the prepare request for 2PC can be sent along with the read request for reading data, therefore the total number of wide-area network roundtrips (WANRTs) observed by the client is at most two (read/prepare + write/commit).
      3. This paper achieves the fast path for preparing (parallelizing 2PC and consensus) by sending prepare requests to all participants rather than participant leaders, which require some conditions as discussed in the paper. This actually saves the time that the leader replicates the prepare decision to its followers.
      4. This paper uses two workloads: Retwis and YCSB+T, for comparison against TAPIR.
    • Comments
      1. This paper is a well-written paper, it gives a very detailed design overview, including assumptions, restrictions and architecture. I think this is really important for system papers.
      2. This paper assumes a restricted transaction (i.e., 2FI transaction) model to parallelize transaction processing, consensus and replication. It points out the drawbacks of this model, i.e., no support for dependent reads and writes, and proposes the solution itself: using reconnaissance transactions (divide the original transaction into two dependent transactions). I think it is a good way to justify your assumptions by pointing out the drawbacks directly and providing solutions.
      3. When justifying anything (e.g., design assumptions), this paper always uses experimental results or cite other papers, I think it is necessary to justify your assumptions in such a way.
      4. It is a good practice to describe the basic version of the system, and then propose some optimizations on top of it to make everything more elegant.
      5. It is important to discuss failure-tolerance, although it is not necessary to implement it.
      6. It is a good practice to describe how you implement your system, especially for system papers.
  • FASTER: A Concurrent Key-Value Store with In-Place Updates

    • Summary
      1. This paper uses a epoch-based synchronization combined with trigger actions to facilitate lazy propagation of global changes to all threads. This generalization of threading model helps simplifying the scalable currency design.
      2. This paper designs a concurrent latch-free resizable cache-friendly hash index by cache-aligned arrays, atomic operations for deleting keys, tentative bits for latch-free two-phase insert, afore-mentioned epoch protection to performing resizing, and atomic operations for checkpointing.
      3. By combing the hash index with a simple in-memory record allocator such as jemalloc, this paper states that they can build a in-memory key-value store easily.
      4. Epoch-based framework also manages the loading of log records to secondary storage in a latch-free manner, they use head and tail offsets to control the circular in-memory buffer.
      5. This paper proposes the HybridLog, which combines in-place updates (in memory) and log-structured organization (on-disk), and consists of three regions: stable region (on secondary storage and append-only), read-only region (in-memory and read-copy-update) and mutable (in-memory and in-place update). The HybridLog can also be used for checkpointing and recovery.
      6. This paper uses one workload: YCSB-A, for comparison against Masstree, Intel TBB concurrent hashmap, RocksDB and Redis.
    • Comments
      1. For a system paper, it is important to describe your design goals, user interfaces and architecture to justify that your system is a well-developed system.
      2. One thing I don’t understand: “A thread has guaranteed access to the memory location of a record, as long as it does not refresh its epoch”. Does this mean that this thread’s operations will be coordinated by the epoch framework, so it has the guaranteed access?
      3. This paper is written in a incremental way: describing the basic framework, presenting a basic component, optimizing this component.
      4. Cache-behavior is important for system papers, especially for systems sensitive to IO latency.
      5. A good design philosophy: make the common case faster.
      6. In-place updates are critical for building a fast in-memory key-value store.
  • Workload-Aware CPU Performance Scaling for Transactional Database Systems

    • Summary
      1. This paper proposes an on-line workload-aware scheduling and frequency scaling algorithm POLARIS, which controls both transaction execution order and processor frequency to minimize CPU power assumption while observing per-workload latency targets.
      2. The control of processor frequency is implemented by DVFS (dynamic voltage and frequency scaling), which is standardized as the Advanced Configuration and Power Interface (ACPI). ACPI defines P-States for different voltage and frequency operating points, and C-States for different idle levels. Linux provides a generic CPU power control module cpufreq.
      3. POLARIS aims to find the smallest processor frequency such that all transactions will finish running before their deadlines. It implements by estimating the execution time of transaction using statistical methods (i.e., keeping track of execution time of transactions in the past).
      4. This paper gives a theoretical analysis of POLARIS, Yao-Demers-Schenker (YDS) and Optimal Available (OA).
      5. This paper implements the prototype within Shore-MT and uses two workloads: TPC-C and TPC-E for comparison against OS baselines.
    • Comments
      1. This paper researches around an interesting problem where few previous studies exist, and it is also important since power consumption is directly related with the maintenance cost of clusters.
      2. The theoretical part is really interesting, in my opinion, there is nothing better than a combination of good theory and useful system.
      3. They have a train phase for POLARIS to gain information about execution time, but I wonder if this obviates the on-line algorithm nature of POLARIS.

Session 4: Query Processing

  • How to Architect a Query Compiler, Revisited

    • Summary
      1. This paper uses Futamura projections to link interpreters and compilers through specialization, which is also the guiding principle in the design of query compilers. Partially evaluating an interpreter with respect to a source program produces a compiled version of that program, is known as the first Futamura projection.
      2. For code generation, the query engine have different options to place it, (1) pure template expansions: each operator is specialized as a string with placeholders for parameters; (2) programmatic specialization: push the specialization into the structures that make up the query engine; (3) optimized programmatic specialization: push code generation to the level of primitive types and operations; (4) lightweight modular staging (LMS): an intermediate representation similar with LLVM.
      3. LB2 (the system built in this paper) uses an abstract class Record as the entry point and an abstract class Buffer as the storage to support both row and column layout; it also abstracts the data structures, indexes. LB2 also hoists memory allocation and other expensive operations from frequently executed paths to less frequent paths, e.g., pre-allocating memory in advance for hash join and aggregate queries. LB2 enabled parallelism by modifying the internal logic of operators through ParOp.
      4. This paper compares against Postgres, Hyper and DBLAB. It uses TPC-H as the workload.
    • Comments
      1. This paper is well written, however I am not familiar with query compiler, I don’t understand some parts, e.g., how parallelism works in practice. It is worth reading it again if needed.
      2. Although this paper proposes a quite novel way for compiling query, almost all real world systems (both in industry and academia) still stick to the old paradigm. It will be interesting to see if new DBMSs adopt such design.
      3. I don’t know if code generation is only used inside database community. I guess the answer is probably yes since code generation is simply a technique to convert domain-specific languages (DSLs) to low-level language for performance’s sake.
  • SuRF: Practical Range Query Filtering with Fast Succinct Tries

    • Summary
      1. SuRF is a data structure for approximate membership test, which supports both single-key lookups and range queries. The core data structure in SuRF is the Fast Succinct Tries (FST), which encodes upper levels with a fast bitmap-based encoding scheme (LOUDS-Dense, speed-efficient) and encodes lower levels with a space efficient LOUDS-Sparse schema. Level-Ordered Unary Degree Sequence (LOUDS) traverses the nodes in a breadth-first order and encodes each node’s degree using the unary code.
      2. Based on the observation that all bit-sequences require either rank or select support but not both, SuRF further optimizes rank, select and local search. They use sampling with pre-computed values to optimize rank and select, and 128-bit SIMD instructions to perform the label search (which is faster than binary search). They also use prefetching so that relevant addresses in other sequences can be computed for future use.
      3. Since FST itself is a trie-based index structure, SuRF must truncate it (i.e., remove lower levels and replace them with suffix bits extracted from the key) to balance between a low false positive rate with small memory usage.
      4. This paper compares against B-tree (and its variants), other succinct tries and Bloom filter. It uses YCSB and a synthesized dataset of integer and string keys as the workloads. They also integrate SuRF into RocksDB and run a evaluation of time-series data.
    • Comments
      1. In general, this paper is well written and it gives comprehensive evaluations from both algorithm’s and system’s perspective.
      2. Although the idea of using fast succinct tries is not novel, combining two different encodings together is a good way to balance between speed and space.
      3. I think this paper is influential (and got the best paper award) because it states that it solves a practical problem: range query of membership test. Although the the practicalness still needs to be proved since real world workloads are far more complicated.
      4. It is important to show your techniques have applications when you write the paper, and it will be even better when you can integrate your stuff into a real world widely-used system.

Session 5: Graph Data Management

  • TopPPR: Top-k Personalized PageRank Queries with Precision Guarantees on Large Graphs

    • Summary
      1. Top-k PPR (Personalized PageRank) query is an important building block for web search and social networks, such as Twitter’s Who-To-Follow recommendation service. However, previous studies cannot guarantee the precision and performance with small overheads at the same time. TopPPR provides a method that (1) is \(\rho\)-precise with at least 1 - 1/n probability, (2) doesn’t require preprocessing, and (3) is computationally efficient.
      2. TopPPR first performs forward search from the source node, and then conducts random walks from those nodes with non-zero forward residues; after that, it applies backward search from some target nodes and combines the results with the random walks to estimate PPR values for top-k derivation.
      3. TopPPR adopts the filter-refinement paradigm for top-k processing. In the filter step, it computes a rough estimation of each node’s PPR, based on which it identifies a candidate node set C; and then in the refinement step, it iteratively refines the PPR estimation for each node in C, until it derives the top-k results with high confidence. Specifically, after forward search, during sampling, it utilizes Bernstein inequality to find a confidence bound, then it employs backward search on those potential nodes to reduce the variation of the confidence bound.
      4. TopPPR employs \(\sqrt{1 - \alpha}\)-walk to improve over random walk with restart, which can update the estimation of more nodes and produce lower variance.
    • Comments
      1. This paper is very well written. It has a good motivation (e.g., real world application in Twitter), and it points the drawbacks of previous studies. Then it formulates the problem, and describes the basic techniques for PPR computation and explains the state-of-the-art solutions. Then it mentions the challenges, ideas and provides an analysis and discussion of the proposed algorithm. The evaluation part is also comprehensive.
      2. It is a fruitful direction combining random methods (e.g., sampling) and deterministic methods (e.g., breadth-first search) adaptively.

Session 6: Storage & Indexing

  • The Case for Learned Index Structures

    • Summary
      1. This paper observes that B-trees can be seen as a model mapping a key to the position of a record, and utilizes deep-learning to train learned indexes based on this observation. The experimental results show that learned indexes have significant advantages (both in space and speed) over traditional indexes.
      2. For learned indexes, this paper proposes a recursive model index which is similar with mixture of experts in machine learning community and learns the distribution of data level by level. Further, it can also include traditional indexes (e.g., B-trees) as a node in the model tree if the distribution is difficult to learn. During search, it queries the recursive model index and uses either binary or biased quaternary search.
      3. Besides range index (i.e., B-trees), learned indexes can also work well on point index (i.e., hash maps) and existence index (e.g., bloom filter).
    • Comments
      1. As the paper says, the idea of replacing core components of a DBMS through learned models have far reaching implications for future system designs. Probably led by this paper, there has been a trend for researching the intersection of machine learning and systems, which are so called machine learning for systems and systems for machine learning. I think this area is a very promising direction in both system and machine learning research with huge real world impact.
      2. Although this paper doesn’t contain much complicated theory or system design, it brings up a fundamental problem: can machine learning and system benefit each other? Therefore, a great paper is not only just making good technical contributions, but also guiding the future direction for the whole community.
  • Dostoevsky: Better Space-Time Trade-Offs for LSM-Tree Based Key-Value Stores via Adaptive Removal of Superfluous Merging

    • Summary
      1. This paper investigates the the space-time trade-off in LSM-tree by introducing Lazy Leveling (which removes merge operations from all levels of LSM-tree but the largest) and Fluid LSM-tree (which generalizes the entire LSM-tree design by parameterizing the merges). They put everything together to design the key-value store Dostoevsky and implemented it on top of RocksDB, and they show that it strictly dominates state-of-the-art designs in terms of performance and storage space.
      2. LSM-tree optimizes for write-heavy workloads, and it organizes runs into L conceptual levels of exponentially increasing sizes. All designs today use either one of two merge policies: tiering or leveling (e.g., Cassandra and RocksDB use tiering and leveling by default, respectively). With tiering, we merge runs within a level only when the level reaches capacity; with leveling, we merge runs within a level whenever a new run comes in.
      3. There is a trade-off between update cost and the costs of lookups and space-amplification. Leveling has strictly better lookup costs and space-amplification and strictly worse update cost than tiering. Furthermore, point lookup cost, long range lookup cost, and space-amplification derive mostly from the largest level, while update cost derives equally from across all levels.
      4. They represent lazy leveling, which applies leveling at the largest level and tiering at all other levels. It improves the cost complexity of updates, maintains the same complexity for point lookups, long range lookups, and space-amplification, and provide competitive for short range lookups.
      5. They propose fluid LSM-tree, a generalization of LSM-tree that enables switching and combining merge policies. It does this by controlling the frequency of merge operations separately for the largest level and for all other levels.
    • Comments
      1. This paper gives a good introduction of LSM-tree and summarizes its properties and operations in great detail.
      2. This paper is well written and organizes its content in array of bullet points, making the illustration clear to understand and the transition smooth.
      3. For the evaluation section, we can start each paragraph with the key observation (e.g., Dostoevsky dominates existing systems).
      4. This paper reminds me that for those traditional data structures, they can actually be parameterized to allow for flexibility and even better trade-off.
  • HOT: A Height Optimized Trie Index for Main-Memory Database Systems

    • Summary
      1. This paper presents the Height Optimized Trie (HOT), a fast and space-efficient in-memory index whose core idea is to dynamically vary the number of bits considered at each node, which enables a consistently high fanout and thereby good cache efficiency. They also carefully engineer the layout of each node for compactness and fast search using SIMD instructions.
      2. To use the space more efficiently, HOT combines the nodes of a binary trie into compound nodes, and it features a data-dependent span and a fixed maximum fanout. During insertion, there are four cases: a normal insert only modifies an existing node, whereas leaf-node pushdown creates a new node. Overflows are either handled using a parent pull up or intermediate node creation.
      3. As for the node layout, the size of the partial keys and the representation of the bit positions (single-mask or multi-mask) can be adapted to fit the data distribution. Partial keys are used to parallel the lookup with SIMD instructions.
      4. HOT uses the combination of copy-on-write and CAS for lookup. For modification, it detects the set of affected nodes and acquire a lock for each of them. It also marks nodes as obsolete instead of directly reclaiming the nodes’ memory and HOT uses a epoch-based memory reclamation strategy.
      5. This paper compares with ART, Masstree, STX B+-tree. The workload is extended based on YCSB. It shows it is 2x space efficient, generally outperforms its state-of-the-art competitors in terms of lookup and scan performance and it features the same linear scalability.
    • Comments
      1. The paper follows a general-to-specific style of writing, it firstly presents the overall data structure, then fills missing details. By adopting such style, it is much easier for readers to grasp the major idea and understand the algorithm.
  • The Data Calculator: Data Structure Design and Cost Synthesis from First Principles and Learned Cost Models

    • Summary
      1. This paper presents the Data Calculator, an interactive and semi-automated design engine for data structures. It offers a set of fine-grained design primitives that capture the first principles of data layout design: how data structure nodes lay data out, and how they are positioned relative to each other. It also supports computation of performance using learned cost models.
      2. The Data Calculator firstly proposes a set of design primitives as fundamental design choices with different domains. Then it introduces elements as a full specification of a single data structure node which defines the data and access methods used to access the node’s data.
      3. The Data Calculator computes the cost (latency) of running a given workload on a given hardware for a particular data structure specification by analyzing the data access primitives and using learned cost models to synthesize the cost of complex operations. These cost models are trained and fitted for combinations of data and hardware profiles.
      4. The Data Calculator supports what-if design (comparing different design specifications) and auto-completion (benchmarking all candidate elements).
    • Comments
      1. This paper provides lots of well-depicted figures with colorful elements, structured layout and illustrative text. This is a good practice when the concepts are difficult to explain in plain text.
  • A Comparative Study of Secondary Indexing Techniques in LSM-based NoSQL Databases

    • Summary
      1. This paper presents a taxonomy of NoSQL secondary indexes, Embedded Indexes (i.e., lightweight filters embedded inside the primary table) and Stand-Alone Indexes (i.e., separate data structures). They built a system LevelDB++ on top of LevelDB to measure two embedded indexes and three state-of-the-art stand-alone indexes.
      2. The experimental study and theoretical evaluation show that none of these indexing techniques dominate the others: the embedded indexes offer superior write throughput and are more space efficient, whereas the stand-alone indexes achieve faster query response times. Thus the optimal choice of secondary index depends on the application workload.
  • Efficient Selection of Geospatial Data on Maps for Interactive and Visualized Exploration

    • Summary
      1. This paper proposes four features that a map rendering system shall support: representativeness, visibility constraint, zooming consistency and panning consistency. They further propose the problem of Interactive Spatial Object Selection (ISOS) problem and devise a greedy algorithm to address it. They also propose to use sampling strategy and pre-fetching strategy to improve the efficiency of their algorithm.
    • Comments
      1. This paper is well written and easy to read. Also, the authors formulate the problem in a precise and easy-to-understand way, and the solutions are presented with good explanation and illustrative examples.
      2. I think the problem discussed in this paper is pretty important, and those constraints are representative for interactive map exploration. The authors did a good job by formulating the complex problem into a well-formed “mathematical” problem and used good techniques to address it. They also provided interesting optimizations and solid proof.
      3. It is important to prove that your proposed metric is useful, especially by experimental evaluations.

Session 7: Tuning, Monitoring & Query Optimization

  • Query-based Workload Forecasting for Self-Driving Database Management Systems

    • Summary
      1. This paper presents QueryBot 5000 that forecasts the expected arrival rate of queries in the future. It firstly pre-processes the queries by converting them into templates, maps these templates to the most similar group of previous queries based on its semantics (e.g., the accessed tables) with clustering techniques, and train those big clusters to predict the arrival rates.
      2. The features for clustering are based on the arrival rate history. They combine the ensemble of linear regression and RNN and kernel regression (which is good at predicting spikes) to predict the arrival rate.
    • Comments
      1. This paper is well written and easy to read.
      2. The whole method essentially is a two-step clustering (by query template then by the history of arrival rate) and then a simple model predicting the future arrival rate based on the past arrival rates. Basically all the query-specific (e.g., the logical and physical features mentioned in the paper) are all discarded, therefore it essentially predicts the future based on the past without using any DB related context information. I think probably it is better to embed more DB information into the feature space.
      3. Also, the whole framework is not end-to-end, there are three components trained or tuned individually. An end-to-end approach may further improve the performance.
      4. If I understand correctly, the system doesn’t work for new query (never seen before by this system). Basically this system can only find arrival patterns of some kind of fixed workloads.
  • On the Calculation of Optimality Ranges for Relational Query Execution Plans

    • Summary
      1. This paper analyzes the optimality range, which is the range of cardinality of an intermediate result where the current plan remains optimal. To compute the optimal range, they propose Parametric Cost Function (where the cardinalities of some intermediate results are parameters) and compute the intersection points of all possible plans.
      2. To compute the optimal range, they propose Parametric Cost Function (where the cardinalities of some intermediate results are parameters) and maintains these optimality ranges by a data structure Optimal Plans Container. The basic idea is to find the intersection points with other candidate plans.
      3. To make the space of candidates small, they adopt the Bell’s principle of optimality (an optimal solution can be always constructed from optimal sub solutions) to only enumerate only plans consisting of sub plans that are somewhere optimal. They further prune those plans whose optimality fall out of the current optimality range. They also derive the theoretical worse case bounds for the number of enumerated pipelines.
      4. Optimality ranges can be used in the following cases: (1) execution plan caching: once the cardinality is out of its optimality range, the cached plan is not optimal anymore and evicted from the cache; (2) parametric queries: store a range in which a plan is optimal instead of a cost point (i.e., a configuration of parameters); (3) mid-query re-optimization: deciding if the optimizer should be invoked again while the current approach uses simple heuristics or considers only a subset of the alternatives.
      5. The workloads are TPC-H, Join Order Benchmark (JOB) and a generated one. However, the improvement for mid-query re-optimization is not very impressive (around 20%).
    • Comments
      1. Although the authors stated that they considered all the alternative plans, actually they made some assumptions on the possible search space.
      2. The optimality ranges are an interesting and important topic in query optimizer. I think probably we may design a better query optimizer with small computation overheads by using these optimality ranges.
  • Adaptive Optimization of Very Large Join Queries

    • Summary
      1. This paper presents an adaptive optimization framework that scales to queries with thousands of joins. It uses the search space linearization technique to find near-optimal execution plans for large classes of queries.
      2. PostgreSQL uses dynamic programming to find the optimal join order for queries with less than 12 relations and switches to genetic algorithms for larger queries, while DB2 uses dynamic programming and switches to a greedy strategy when the queries become large.
      3. For small queries (less than 14 relations or the number of connected subgraphs is within the chosen budget of 10,000), the framework uses DPHyp. For medium queries (up too 100 relations), the framework firstly uses IKKBZ to linearize the search space into a linear ordering of relations and it restricts the DP algorithm to consider only connected subchains of this linear relation ordering. For large queries, it adopts the idea from Iterative DP: first constructing an execution plan using a greedy algorithm (Greedy Operator Ordering), and then improving that plan by running a more expensive optimization algorithm (the algorithm for medium queries) on subplans up to size k (and k can be changed to control the optimization time).
      4. This paper uses standard benchmarks: TPC-H, TPC-DS, LDBC BI, Join Order Benchmark, SQLite test suite. They also generate synthetic queries for scalability experiments.
    • Comments
      1. This paper is well written and easy to read. It also offers a pretty useful related works section that presents almost all the methods for finding join order, and they essentially provide an hybrid approach that uses different methods for different cases.
      2. It provides a section of implementation details, which is pretty important for system papers.
      3. I think this paper not only is a comprehensive summary of previous studies on joins, but also provides a practical way to handle joins in different cases.
  • Improving Join Reorderability with Compensation Operators

    • Summary
      1. This paper presents a novel approach for join reordering problem for queries involving inner-joins, single-sided outer-joins, and/or antijoins, i.e., providing a more comprehensive enumeration of possible orders for join.
      2. There are two state-of-the-art approaches for the join reorder problem: (1) Transformation-Based Approach (TBA), which enumerates all valid join reorderings using the associativity and commutativity properties of the join operators; (2) Compensation-Based Approach (CBA), which permits certain invalid join reorderings as long as they can be compensated to become valid. This paper proposes a algorithm based on the compensation-based approach, which proposes two new compensation operators to allow more rules for reorder joins.
      3. This paper presents the enumeration algorithm based on the reordering algorithm, and optimizes it to enable reuse of query subplans by finding the equivalence while considering the dependencies of query nodes.
      4. This paper uses TPC-H as its workload and implements the algorithm in PostgreSQL.
    • Comments
      1. This paper essentially provides a way to extend the search space of all possible joins and explain how to explore this search space.
      2. The idea of compensation is really a interesting topic in many areas, by introducing some compensation operations, we can apply some “invalid” transformations to provide better flexibility and performance improvement.

Session 8: Spatial Data & Streams

  • DITA: Distributed In-Memory Trajectory Analytics

    • Summary
      1. This paper presents DITA, a distributed in-memory trajectory analytics system which support trajectory similarity search and join (both threshold-based and KNN-based) with numerous similarity functions. The core idea is a filter-verification framework which employs a trie-based method to efficiently search similar trajectories. To distribute the computation, an partitioning method is proposed based on the trie indexing mechanism and a cost model is constructed to balance the workloads. The whole system is built on Spark.
    • Comments
      1. This is my paper and I think it is good. It supports huge volume of trajectories and is significantly better than state-of-the-art.
  • Sketching Linear Classifiers over Data Streams

    • Summary
      1. This paper presents the Weight-Median Sketch for learning linear classifiers over data streams that supports approximate retrieval of the most heavily-weighted features. In other words, it does dimension reduction in the streaming setting.
      2. It essentially adopts the idea of count-sketch to save the gradients (for training the classifier) in a sketch-like structure. By “active set”, they simply introduce a heap to store heavy hitters (i.e., heavy weights).
      3. They show three applications: streaming explanation (outlier detection), network monitoring (identifying differences between concurrent traffic streams), streaming pointwise mutual information (finding correlation between events).
    • Comments
      1. This paper is well written. It sort of “transfers” idea from one sub-area to another sub-area, I think this is interesting.
      2. It is important to show some real world applications in the paper.
Written on June 15, 2018