Some Good Papers in SIGMOD 2018 (Research Sessions 18)
In this post, I write my reading report for research papers in SIGMOD 2018. I will go through all the papers I feel interesting and write their basic ideas, methods and my personal evaluations. Please let me know if you find anything inappropriate.
Session 1: Data Integration & Cleaning

Deep Learning for Entity Matching: A Design Space Exploration
 Summary
 This paper proposes a categorization of DL solutions in EM: (1) attribute embedding; (2) attribute similarity representation: attribute summarization and attribute comparison; (3) classifier.
 This paper considers three types of EM problems: structured, textual and dirty.
 DL is competitive with stateoftheart on structured instances with far longer training time, while it outperforms stateoftheart on textual and dirty instances.
 Comments
 This paper is wellwritten and provides a good summary of DL solutions in EM. It proposes the design space for DL in EM and summarizes the problem types in EM, i.e., both the problem and solutions are welldefined.
 This paper is a good example of how to apply DL in database research: categorizing the problem types, summarizing design space, applying DL and doing empirical evaluations.
 It is really important to provide heuristics for the design choices, discussions for why the methods are better, goals and takeaways for the experiments.
 Summary
Session 2: Usability & Security/Privacy

The Data Interaction Game
 Summary
 Users actually interact with DBMSs during which they learn and modify how to express their information needs, which is rarely captured in current query interfaces, i.e., the user feedbacks are not well utilized or user strategies are assumed fixed. Therefore this paper models this as a game with identical interest between two rational agents whose goal is to establish a common language for representing information needs in form of queries. It further proposes a reinforcement learning method to explore this model.
 This paper firstly uses a reinforcement learning algorithm (Roth and Erev’s model) to imitate and model the user behavior (justified by empirical analysis), then uses the same reinforcement algorithm to train the interactions between users and DBMSs.
 Comments
 This paper is too dry to read, probably because it contains too much theoretical stuff and few illustrative examples. Probably I will read back when I have an better knowledge of reinforcement learning.
 One thing I don’t understand is that they model the user behaviors with a reinforcement learning algorithm, and then they train the interactions with the same algorithm. Since the user behavior is already modeled, it doesn’t make sense for me that reinforcement learning can model the interactions in real world. Although I understand that collecting user feedbacks for training the interactions are difficult, this algorithm still seems making too strong assumptions and not very practical.
 Summary
Session 3: Transactions & Indexing

Carousel: LowLatency Transaction Processing for GloballyDistributed Data
 Summary
 Most storage systems support geodistributed and multipartitioned transactions by layering transaction management, such as concurrency control and twophase commit (2PC), on top of a consensus protocol, which incurs high latency to commit a transaction because it sequentially executes the layered protocols. This paper addresses this by targeting 2round Fixedset Interactive (2FI) transactions (where each transaction consists of a round of reads followed by a round of writes with read and write keys that are known at the start of the transaction) to enable transaction processing to overlap with 2PC and state replication.
 By knowing the read and write sets (the assumption of 2FI transactions), the prepare request for 2PC can be sent along with the read request for reading data, therefore the total number of widearea network roundtrips (WANRTs) observed by the client is at most two (read/prepare + write/commit).
 This paper achieves the fast path for preparing (parallelizing 2PC and consensus) by sending prepare requests to all participants rather than participant leaders, which require some conditions as discussed in the paper. This actually saves the time that the leader replicates the prepare decision to its followers.
 This paper uses two workloads: Retwis and YCSB+T, for comparison against TAPIR.
 Comments
 This paper is a wellwritten paper, it gives a very detailed design overview, including assumptions, restrictions and architecture. I think this is really important for system papers.
 This paper assumes a restricted transaction (i.e., 2FI transaction) model to parallelize transaction processing, consensus and replication. It points out the drawbacks of this model, i.e., no support for dependent reads and writes, and proposes the solution itself: using reconnaissance transactions (divide the original transaction into two dependent transactions). I think it is a good way to justify your assumptions by pointing out the drawbacks directly and providing solutions.
 When justifying anything (e.g., design assumptions), this paper always uses experimental results or cite other papers, I think it is necessary to justify your assumptions in such a way.
 It is a good practice to describe the basic version of the system, and then propose some optimizations on top of it to make everything more elegant.
 It is important to discuss failuretolerance, although it is not necessary to implement it.
 It is a good practice to describe how you implement your system, especially for system papers.
 Summary

FASTER: A Concurrent KeyValue Store with InPlace Updates
 Summary
 This paper uses a epochbased synchronization combined with trigger actions to facilitate lazy propagation of global changes to all threads. This generalization of threading model helps simplifying the scalable currency design.
 This paper designs a concurrent latchfree resizable cachefriendly hash index by cachealigned arrays, atomic operations for deleting keys, tentative bits for latchfree twophase insert, aforementioned epoch protection to performing resizing, and atomic operations for checkpointing.
 By combing the hash index with a simple inmemory record allocator such as jemalloc, this paper states that they can build a inmemory keyvalue store easily.
 Epochbased framework also manages the loading of log records to secondary storage in a latchfree manner, they use head and tail offsets to control the circular inmemory buffer.
 This paper proposes the HybridLog, which combines inplace updates (in memory) and logstructured organization (ondisk), and consists of three regions: stable region (on secondary storage and appendonly), readonly region (inmemory and readcopyupdate) and mutable (inmemory and inplace update). The HybridLog can also be used for checkpointing and recovery.
 This paper uses one workload: YCSBA, for comparison against Masstree, Intel TBB concurrent hashmap, RocksDB and Redis.
 Comments
 For a system paper, it is important to describe your design goals, user interfaces and architecture to justify that your system is a welldeveloped system.
 One thing I don’t understand: “A thread has guaranteed access to the memory location of a record, as long as it does not refresh its epoch”. Does this mean that this thread’s operations will be coordinated by the epoch framework, so it has the guaranteed access?
 This paper is written in a incremental way: describing the basic framework, presenting a basic component, optimizing this component.
 Cachebehavior is important for system papers, especially for systems sensitive to IO latency.
 A good design philosophy: make the common case faster.
 Inplace updates are critical for building a fast inmemory keyvalue store.
 Summary

WorkloadAware CPU Performance Scaling for Transactional Database Systems
 Summary
 This paper proposes an online workloadaware scheduling and frequency scaling algorithm POLARIS, which controls both transaction execution order and processor frequency to minimize CPU power assumption while observing perworkload latency targets.
 The control of processor frequency is implemented by DVFS (dynamic voltage and frequency scaling), which is standardized as the Advanced Configuration and Power Interface (ACPI). ACPI defines PStates for different voltage and frequency operating points, and CStates for different idle levels. Linux provides a generic CPU power control module cpufreq.
 POLARIS aims to find the smallest processor frequency such that all transactions will finish running before their deadlines. It implements by estimating the execution time of transaction using statistical methods (i.e., keeping track of execution time of transactions in the past).
 This paper gives a theoretical analysis of POLARIS, YaoDemersSchenker (YDS) and Optimal Available (OA).
 This paper implements the prototype within ShoreMT and uses two workloads: TPCC and TPCE for comparison against OS baselines.
 Comments
 This paper researches around an interesting problem where few previous studies exist, and it is also important since power consumption is directly related with the maintenance cost of clusters.
 The theoretical part is really interesting, in my opinion, there is nothing better than a combination of good theory and useful system.
 They have a train phase for POLARIS to gain information about execution time, but I wonder if this obviates the online algorithm nature of POLARIS.
 Summary
Session 4: Query Processing

How to Architect a Query Compiler, Revisited
 Summary
 This paper uses Futamura projections to link interpreters and compilers through specialization, which is also the guiding principle in the design of query compilers. Partially evaluating an interpreter with respect to a source program produces a compiled version of that program, is known as the first Futamura projection.
 For code generation, the query engine have different options to place it, (1) pure template expansions: each operator is specialized as a string with placeholders for parameters; (2) programmatic specialization: push the specialization into the structures that make up the query engine; (3) optimized programmatic specialization: push code generation to the level of primitive types and operations; (4) lightweight modular staging (LMS): an intermediate representation similar with LLVM.
 LB2 (the system built in this paper) uses an abstract class Record as the entry point and an abstract class Buffer as the storage to support both row and column layout; it also abstracts the data structures, indexes. LB2 also hoists memory allocation and other expensive operations from frequently executed paths to less frequent paths, e.g., preallocating memory in advance for hash join and aggregate queries. LB2 enabled parallelism by modifying the internal logic of operators through ParOp.
 This paper compares against Postgres, Hyper and DBLAB. It uses TPCH as the workload.
 Comments
 This paper is well written, however I am not familiar with query compiler, I don’t understand some parts, e.g., how parallelism works in practice. It is worth reading it again if needed.
 Although this paper proposes a quite novel way for compiling query, almost all real world systems (both in industry and academia) still stick to the old paradigm. It will be interesting to see if new DBMSs adopt such design.
 I don’t know if code generation is only used inside database community. I guess the answer is probably yes since code generation is simply a technique to convert domainspecific languages (DSLs) to lowlevel language for performance’s sake.
 Summary

SuRF: Practical Range Query Filtering with Fast Succinct Tries
 Summary
 SuRF is a data structure for approximate membership test, which supports both singlekey lookups and range queries. The core data structure in SuRF is the Fast Succinct Tries (FST), which encodes upper levels with a fast bitmapbased encoding scheme (LOUDSDense, speedefficient) and encodes lower levels with a space efficient LOUDSSparse schema. LevelOrdered Unary Degree Sequence (LOUDS) traverses the nodes in a breadthfirst order and encodes each node’s degree using the unary code.
 Based on the observation that all bitsequences require either rank or select support but not both, SuRF further optimizes rank, select and local search. They use sampling with precomputed values to optimize rank and select, and 128bit SIMD instructions to perform the label search (which is faster than binary search). They also use prefetching so that relevant addresses in other sequences can be computed for future use.
 Since FST itself is a triebased index structure, SuRF must truncate it (i.e., remove lower levels and replace them with suffix bits extracted from the key) to balance between a low false positive rate with small memory usage.
 This paper compares against Btree (and its variants), other succinct tries and Bloom filter. It uses YCSB and a synthesized dataset of integer and string keys as the workloads. They also integrate SuRF into RocksDB and run a evaluation of timeseries data.
 Comments
 In general, this paper is well written and it gives comprehensive evaluations from both algorithm’s and system’s perspective.
 Although the idea of using fast succinct tries is not novel, combining two different encodings together is a good way to balance between speed and space.
 I think this paper is influential (and got the best paper award) because it states that it solves a practical problem: range query of membership test. Although the the practicalness still needs to be proved since real world workloads are far more complicated.
 It is important to show your techniques have applications when you write the paper, and it will be even better when you can integrate your stuff into a real world widelyused system.
 Summary
Session 5: Graph Data Management

TopPPR: Topk Personalized PageRank Queries with Precision Guarantees on Large Graphs
 Summary
 Topk PPR (Personalized PageRank) query is an important building block for web search and social networks, such as Twitter’s WhoToFollow recommendation service. However, previous studies cannot guarantee the precision and performance with small overheads at the same time. TopPPR provides a method that (1) is precise with at least 1  1/n probability, (2) doesn’t require preprocessing, and (3) is computationally efficient.
 TopPPR first performs forward search from the source node, and then conducts random walks from those nodes with nonzero forward residues; after that, it applies backward search from some target nodes and combines the results with the random walks to estimate PPR values for topk derivation.
 TopPPR adopts the filterrefinement paradigm for topk processing. In the filter step, it computes a rough estimation of each node’s PPR, based on which it identifies a candidate node set C; and then in the refinement step, it iteratively refines the PPR estimation for each node in C, until it derives the topk results with high confidence. Specifically, after forward search, during sampling, it utilizes Bernstein inequality to find a confidence bound, then it employs backward search on those potential nodes to reduce the variation of the confidence bound.
 TopPPR employs walk to improve over random walk with restart, which can update the estimation of more nodes and produce lower variance.
 Comments
 This paper is very well written. It has a good motivation (e.g., real world application in Twitter), and it points the drawbacks of previous studies. Then it formulates the problem, and describes the basic techniques for PPR computation and explains the stateoftheart solutions. Then it mentions the challenges, ideas and provides an analysis and discussion of the proposed algorithm. The evaluation part is also comprehensive.
 It is a fruitful direction combining random methods (e.g., sampling) and deterministic methods (e.g., breadthfirst search) adaptively.
 Summary
Session 6: Storage & Indexing

The Case for Learned Index Structures
 Summary
 This paper observes that Btrees can be seen as a model mapping a key to the position of a record, and utilizes deeplearning to train learned indexes based on this observation. The experimental results show that learned indexes have significant advantages (both in space and speed) over traditional indexes.
 For learned indexes, this paper proposes a recursive model index which is similar with mixture of experts in machine learning community and learns the distribution of data level by level. Further, it can also include traditional indexes (e.g., Btrees) as a node in the model tree if the distribution is difficult to learn. During search, it queries the recursive model index and uses either binary or biased quaternary search.
 Besides range index (i.e., Btrees), learned indexes can also work well on point index (i.e., hash maps) and existence index (e.g., bloom filter).
 Comments
 As the paper says, the idea of replacing core components of a DBMS through learned models have far reaching implications for future system designs. Probably led by this paper, there has been a trend for researching the intersection of machine learning and systems, which are so called machine learning for systems and systems for machine learning. I think this area is a very promising direction in both system and machine learning research with huge real world impact.
 Although this paper doesn’t contain much complicated theory or system design, it brings up a fundamental problem: can machine learning and system benefit each other? Therefore, a great paper is not only just making good technical contributions, but also guiding the future direction for the whole community.
 Summary

Dostoevsky: Better SpaceTime TradeOffs for LSMTree Based KeyValue Stores via Adaptive Removal of Superfluous Merging
 Summary
 This paper investigates the the spacetime tradeoff in LSMtree by introducing Lazy Leveling (which removes merge operations from all levels of LSMtree but the largest) and Fluid LSMtree (which generalizes the entire LSMtree design by parameterizing the merges). They put everything together to design the keyvalue store Dostoevsky and implemented it on top of RocksDB, and they show that it strictly dominates stateoftheart designs in terms of performance and storage space.
 LSMtree optimizes for writeheavy workloads, and it organizes runs into L conceptual levels of exponentially increasing sizes. All designs today use either one of two merge policies: tiering or leveling (e.g., Cassandra and RocksDB use tiering and leveling by default, respectively). With tiering, we merge runs within a level only when the level reaches capacity; with leveling, we merge runs within a level whenever a new run comes in.
 There is a tradeoff between update cost and the costs of lookups and spaceamplification. Leveling has strictly better lookup costs and spaceamplification and strictly worse update cost than tiering. Furthermore, point lookup cost, long range lookup cost, and spaceamplification derive mostly from the largest level, while update cost derives equally from across all levels.
 They represent lazy leveling, which applies leveling at the largest level and tiering at all other levels. It improves the cost complexity of updates, maintains the same complexity for point lookups, long range lookups, and spaceamplification, and provide competitive for short range lookups.
 They propose fluid LSMtree, a generalization of LSMtree that enables switching and combining merge policies. It does this by controlling the frequency of merge operations separately for the largest level and for all other levels.
 Comments
 This paper gives a good introduction of LSMtree and summarizes its properties and operations in great detail.
 This paper is well written and organizes its content in array of bullet points, making the illustration clear to understand and the transition smooth.
 For the evaluation section, we can start each paragraph with the key observation (e.g., Dostoevsky dominates existing systems).
 This paper reminds me that for those traditional data structures, they can actually be parameterized to allow for flexibility and even better tradeoff.
 Summary

HOT: A Height Optimized Trie Index for MainMemory Database Systems
 Summary
 This paper presents the Height Optimized Trie (HOT), a fast and spaceefficient inmemory index whose core idea is to dynamically vary the number of bits considered at each node, which enables a consistently high fanout and thereby good cache efficiency. They also carefully engineer the layout of each node for compactness and fast search using SIMD instructions.
 To use the space more efficiently, HOT combines the nodes of a binary trie into compound nodes, and it features a datadependent span and a fixed maximum fanout. During insertion, there are four cases: a normal insert only modifies an existing node, whereas leafnode pushdown creates a new node. Overflows are either handled using a parent pull up or intermediate node creation.
 As for the node layout, the size of the partial keys and the representation of the bit positions (singlemask or multimask) can be adapted to fit the data distribution. Partial keys are used to parallel the lookup with SIMD instructions.
 HOT uses the combination of copyonwrite and CAS for lookup. For modification, it detects the set of affected nodes and acquire a lock for each of them. It also marks nodes as obsolete instead of directly reclaiming the nodes’ memory and HOT uses a epochbased memory reclamation strategy.
 This paper compares with ART, Masstree, STX B+tree. The workload is extended based on YCSB. It shows it is 2x space efficient, generally outperforms its stateoftheart competitors in terms of lookup and scan performance and it features the same linear scalability.
 Comments
 The paper follows a generaltospecific style of writing, it firstly presents the overall data structure, then fills missing details. By adopting such style, it is much easier for readers to grasp the major idea and understand the algorithm.
 Summary

The Data Calculator: Data Structure Design and Cost Synthesis from First Principles and Learned Cost Models
 Summary
 This paper presents the Data Calculator, an interactive and semiautomated design engine for data structures. It offers a set of finegrained design primitives that capture the first principles of data layout design: how data structure nodes lay data out, and how they are positioned relative to each other. It also supports computation of performance using learned cost models.
 The Data Calculator firstly proposes a set of design primitives as fundamental design choices with different domains. Then it introduces elements as a full specification of a single data structure node which defines the data and access methods used to access the node’s data.
 The Data Calculator computes the cost (latency) of running a given workload on a given hardware for a particular data structure specification by analyzing the data access primitives and using learned cost models to synthesize the cost of complex operations. These cost models are trained and fitted for combinations of data and hardware profiles.
 The Data Calculator supports whatif design (comparing different design specifications) and autocompletion (benchmarking all candidate elements).
 Comments
 This paper provides lots of welldepicted figures with colorful elements, structured layout and illustrative text. This is a good practice when the concepts are difficult to explain in plain text.
 Summary

A Comparative Study of Secondary Indexing Techniques in LSMbased NoSQL Databases
 Summary
 This paper presents a taxonomy of NoSQL secondary indexes, Embedded Indexes (i.e., lightweight filters embedded inside the primary table) and StandAlone Indexes (i.e., separate data structures). They built a system LevelDB++ on top of LevelDB to measure two embedded indexes and three stateoftheart standalone indexes.
 The experimental study and theoretical evaluation show that none of these indexing techniques dominate the others: the embedded indexes offer superior write throughput and are more space efficient, whereas the standalone indexes achieve faster query response times. Thus the optimal choice of secondary index depends on the application workload.
 Summary

Efficient Selection of Geospatial Data on Maps for Interactive and Visualized Exploration
 Summary
 This paper proposes four features that a map rendering system shall support: representativeness, visibility constraint, zooming consistency and panning consistency. They further propose the problem of Interactive Spatial Object Selection (ISOS) problem and devise a greedy algorithm to address it. They also propose to use sampling strategy and prefetching strategy to improve the efficiency of their algorithm.
 Comments
 This paper is well written and easy to read. Also, the authors formulate the problem in a precise and easytounderstand way, and the solutions are presented with good explanation and illustrative examples.
 I think the problem discussed in this paper is pretty important, and those constraints are representative for interactive map exploration. The authors did a good job by formulating the complex problem into a wellformed “mathematical” problem and used good techniques to address it. They also provided interesting optimizations and solid proof.
 It is important to prove that your proposed metric is useful, especially by experimental evaluations.
 Summary
Session 7: Tuning, Monitoring & Query Optimization

Querybased Workload Forecasting for SelfDriving Database Management Systems
 Summary
 This paper presents QueryBot 5000 that forecasts the expected arrival rate of queries in the future. It firstly preprocesses the queries by converting them into templates, maps these templates to the most similar group of previous queries based on its semantics (e.g., the accessed tables) with clustering techniques, and train those big clusters to predict the arrival rates.
 The features for clustering are based on the arrival rate history. They combine the ensemble of linear regression and RNN and kernel regression (which is good at predicting spikes) to predict the arrival rate.
 Comments
 This paper is well written and easy to read.
 The whole method essentially is a twostep clustering (by query template then by the history of arrival rate) and then a simple model predicting the future arrival rate based on the past arrival rates. Basically all the queryspecific (e.g., the logical and physical features mentioned in the paper) are all discarded, therefore it essentially predicts the future based on the past without using any DB related context information. I think probably it is better to embed more DB information into the feature space.
 Also, the whole framework is not endtoend, there are three components trained or tuned individually. An endtoend approach may further improve the performance.
 If I understand correctly, the system doesn’t work for new query (never seen before by this system). Basically this system can only find arrival patterns of some kind of fixed workloads.
 Summary

On the Calculation of Optimality Ranges for Relational Query Execution Plans
 Summary
 This paper analyzes the optimality range, which is the range of cardinality of an intermediate result where the current plan remains optimal. To compute the optimal range, they propose Parametric Cost Function (where the cardinalities of some intermediate results are parameters) and compute the intersection points of all possible plans.
 To compute the optimal range, they propose Parametric Cost Function (where the cardinalities of some intermediate results are parameters) and maintains these optimality ranges by a data structure Optimal Plans Container. The basic idea is to find the intersection points with other candidate plans.
 To make the space of candidates small, they adopt the Bell’s principle of optimality (an optimal solution can be always constructed from optimal sub solutions) to only enumerate only plans consisting of sub plans that are somewhere optimal. They further prune those plans whose optimality fall out of the current optimality range. They also derive the theoretical worse case bounds for the number of enumerated pipelines.
 Optimality ranges can be used in the following cases: (1) execution plan caching: once the cardinality is out of its optimality range, the cached plan is not optimal anymore and evicted from the cache; (2) parametric queries: store a range in which a plan is optimal instead of a cost point (i.e., a configuration of parameters); (3) midquery reoptimization: deciding if the optimizer should be invoked again while the current approach uses simple heuristics or considers only a subset of the alternatives.
 The workloads are TPCH, Join Order Benchmark (JOB) and a generated one. However, the improvement for midquery reoptimization is not very impressive (around 20%).
 Comments
 Although the authors stated that they considered all the alternative plans, actually they made some assumptions on the possible search space.
 The optimality ranges are an interesting and important topic in query optimizer. I think probably we may design a better query optimizer with small computation overheads by using these optimality ranges.
 Summary

Adaptive Optimization of Very Large Join Queries
 Summary
 This paper presents an adaptive optimization framework that scales to queries with thousands of joins. It uses the search space linearization technique to find nearoptimal execution plans for large classes of queries.
 PostgreSQL uses dynamic programming to find the optimal join order for queries with less than 12 relations and switches to genetic algorithms for larger queries, while DB2 uses dynamic programming and switches to a greedy strategy when the queries become large.
 For small queries (less than 14 relations or the number of connected subgraphs is within the chosen budget of 10,000), the framework uses DPHyp. For medium queries (up too 100 relations), the framework firstly uses IKKBZ to linearize the search space into a linear ordering of relations and it restricts the DP algorithm to consider only connected subchains of this linear relation ordering. For large queries, it adopts the idea from Iterative DP: first constructing an execution plan using a greedy algorithm (Greedy Operator Ordering), and then improving that plan by running a more expensive optimization algorithm (the algorithm for medium queries) on subplans up to size k (and k can be changed to control the optimization time).
 This paper uses standard benchmarks: TPCH, TPCDS, LDBC BI, Join Order Benchmark, SQLite test suite. They also generate synthetic queries for scalability experiments.
 Comments
 This paper is well written and easy to read. It also offers a pretty useful related works section that presents almost all the methods for finding join order, and they essentially provide an hybrid approach that uses different methods for different cases.
 It provides a section of implementation details, which is pretty important for system papers.
 I think this paper not only is a comprehensive summary of previous studies on joins, but also provides a practical way to handle joins in different cases.
 Summary

Improving Join Reorderability with Compensation Operators
 Summary
 This paper presents a novel approach for join reordering problem for queries involving innerjoins, singlesided outerjoins, and/or antijoins, i.e., providing a more comprehensive enumeration of possible orders for join.
 There are two stateoftheart approaches for the join reorder problem: (1) TransformationBased Approach (TBA), which enumerates all valid join reorderings using the associativity and commutativity properties of the join operators; (2) CompensationBased Approach (CBA), which permits certain invalid join reorderings as long as they can be compensated to become valid. This paper proposes a algorithm based on the compensationbased approach, which proposes two new compensation operators to allow more rules for reorder joins.
 This paper presents the enumeration algorithm based on the reordering algorithm, and optimizes it to enable reuse of query subplans by finding the equivalence while considering the dependencies of query nodes.
 This paper uses TPCH as its workload and implements the algorithm in PostgreSQL.
 Comments
 This paper essentially provides a way to extend the search space of all possible joins and explain how to explore this search space.
 The idea of compensation is really a interesting topic in many areas, by introducing some compensation operations, we can apply some “invalid” transformations to provide better flexibility and performance improvement.
 Summary
Session 8: Spatial Data & Streams

DITA: Distributed InMemory Trajectory Analytics
 Summary
 This paper presents DITA, a distributed inmemory trajectory analytics system which support trajectory similarity search and join (both thresholdbased and KNNbased) with numerous similarity functions. The core idea is a filterverification framework which employs a triebased method to efficiently search similar trajectories. To distribute the computation, an partitioning method is proposed based on the trie indexing mechanism and a cost model is constructed to balance the workloads. The whole system is built on Spark.
 Comments
 This is my paper and I think it is good. It supports huge volume of trajectories and is significantly better than stateoftheart.
 Summary

Sketching Linear Classifiers over Data Streams
 Summary
 This paper presents the WeightMedian Sketch for learning linear classifiers over data streams that supports approximate retrieval of the most heavilyweighted features. In other words, it does dimension reduction in the streaming setting.
 It essentially adopts the idea of countsketch to save the gradients (for training the classifier) in a sketchlike structure. By “active set”, they simply introduce a heap to store heavy hitters (i.e., heavy weights).
 They show three applications: streaming explanation (outlier detection), network monitoring (identifying differences between concurrent traffic streams), streaming pointwise mutual information (finding correlation between events).
 Comments
 This paper is well written. It sort of “transfers” idea from one subarea to another subarea, I think this is interesting.
 It is important to show some real world applications in the paper.
 Summary
Written on June 15, 2018