Some Good Papers in SIGMOD 2018 (Industry Sessions)

In this post, I am going to write my reading report for industry papers in SIGMOD 2018. I will go through all the papers I feel interesting and write their basic ideas, methods and my personal evaluations. Please let me know if you find anything inappropriate.

Session 1: Adaptive Query Processing

Computation Reuse in Analytics Job Service at Microsoft
- Summary
  1. This paper presents a computation reuse framework, CLOUDVIEWS, addressing the computation overlap problem in Microsoft’s SCOPE job service.
  2. To materialize overlapping computations over recurring jobs (jobs that appear repeatedly, have template changes in each instance, and operate over new data each time), they use normalized signatures (which normalize the recurring changes) to identify subgraphs across recurring instances for materialization and precise signature to identify subgraphs within a recurring instance for reuse.
  3. To provide accurate estimation for materialization, they use a feedback loop which extracts runtime statistics from the previous runs (by enumerating all possible subgraphs of all jobs seen within a time window in the past).
  4. For the runtime, they build a metadata service for managing the information of materialized views, which provides looking-up (with inverted index) and saving (with exclusive lock). To prevent from multiple jobs with the same overlapping computation being scheduled concurrently, they reorder recurring jobs in the client job submission systems.
  5. Computation reuse actually finds hidden redundancies, promotes data sharing across teams, provides better reliability and better cost estimates.
- Comments
  1. This paper is well written and structured. As an industry paper, it depicts the challenge, architecture, interface in a clear way. The section for “Lessons Learned” is really interesting.

Session 2: Real-time Analytics

Pinot: Realtime OLAP for 530 Million Users
- Summary
  1. This paper presents Pinot, a single system at Linkedin serving tens of thousands of analytical queries per second, while offering near-realtime data ingestion from streaming data sources and handling the operational requirements of large web properties.
  2. Pinot is used to power customer facing applications such as “Who viewed my profile” (WVMP) and newsfeed customization which requires very low latency, as well as internal business analyst dashboards where users want to slice and dice data.
  3. Pinot follows the lambda architecture, and supports near-realtime data ingestion by reading from Kafka and offline data from Hadoop. Zookeeper is used as persistent metadata store and as the communication mechanism between nodes in the cluster.
  4. Pinot uses fixed schema for tables, and tables are composed of segments. Segments are replicated and data in segments is immutable. Data orientation in Pinot segments is columnar, and various encodings are supported.
  5. Pinot has been designed as a share-nothing architecture with stateless instances to be able to run on cloud infrastructure.
- Comments
  1. This paper is well written and structured. It provides a detailed and comprehensive solution to realtime OLAP analytics.
Robust, Scalable, Real-Time Event Time Series Aggregation at Twitter
- Summary
  1. This paper presents TSAR (TimeSeries AggregatorR), a robust, scalable, real-time event time series aggregation framework built primarily for engagement monitoring: aggregating interactions with Tweets, segmented along a multitude of dimensions such as device, engagement type, etc.
  2. TSAR is built on top of Summingbird, an open-source framework for integrating batch and online MapReduce computations.
  3. TSAR relies on Twitter’s Manhattan key-value store to provide access for high-load dashboard applications. The output of batch jobs is first written to HDFS, and then bulk imported into Manhattan.
- Comments
  1. This paper mentions the Lambda Architecture, and the Kappa Architecture - where everything is a stream and therefore there is no distinction between batch and stream processing. In the future, it seems that there will be a unified architecture incorporating these two architectures.
TcpRT: Instrument and Diagnostic Analysis System for Service Quality of Cloud Databases at Massive Scale in Real-time
- Summary
  1. This paper presents TcpRT, the instrument and diagnosis in Alibaba Cloud RDS for real-time anomaly detection.
  2. The overall workflow/architecture TcpRT is: the kernel module collects the metrics of each query, then sends them to a local process for aggregations, the results are written to Kafka, and ETL jobs are triggered to be running on JStorm to process the data in Kafaka and transform them into time series. These time series outputs are cached in Redist cluster for a while and then they are flushed to HybridDB. The automatic anomaly detection module scans time series data in Redis Cluster and HybridDB periodically.
  3. They implement the collector module on top of the TCP congestion control module in Linux kernel, and they use a customized debugfs (a high performance in-memory filesystem) to transfer the collected trace records to the user space for further aggregation and transmission.
  4. To support exact-one semantics, they have an independent offline repair job running to replay the data within the failure time with eventual consistency.
  5. As for anomaly detection, they use a self-adjustable Cauchy distribution statistical model from historical performance data for each DB instance. They also refer to the network topology and anomalous events (e.g., out-of-order, retransmissions) to detect network issues.
- Comments
  1. This paper is well written and it provides a detailed solution to real-time monitoring. The visualization/reporting component is as important as the infrastructure.

Session 3: DB systems in the Cloud and Open Source

Amazon Aurora: On Avoiding Distributed Consensus for I/Os, Commits, and Membership Changes
- Summary
  1. This paper presents Amazon Aurora, a high-throughput cloud-native relational database, which pushes redo processing to a multi-tenant scale-out storage service.
  2. Aurora consists of database instances which act as SQL endpoints and include most of the components of a database kernel (query processing, access methods, transactions, locking, buffer caching and undo management) and storage fleet which takes over redo logging, materialization of data blocks, garbage collection and backup/restore.
  3. Aurora uses quorum model for read/write, where a system employs \(V\) copies and must obey two rules: (1) \(V_r + V_w > V\); (2) \(V_w > V/2\). Storage is partitioned into segments which are the minimum unit of failure, and they are small with no more than 10GB of addressable data blocks.
  4. Each storage node in Aurora maintains a local Segment Complete LSN (SCL), and piggybacks it to the database instance which advances the Protection Group Complete LSN (PGCL) if four of six members of the protection group of storage nodes have advanced. They further have a Volume Complete LSN (VCL) on top of PGCL.
  5. For crash recovery, it simply computes VCL and annuls any log records beyond VCL, and it uses volume epoch for establishing write quorum.
  6. Aurora doesn’t do quorum read as the database instance knows which segments have the last durable version of a data block and can request it directly.
  7. Aurora uses epoch for membership change, and they make two transitions (add the new member and then discard the failed member) to make each transition is reversible.
  8. In Aurora, a protection group is composed of three full segments, which store both redo log records and materialized data blocks, and three tail segments, which contain redo log records alone. By doing this, it gives a smaller cost amplification and provides better flexibility.
- Comments
  1. This paper is well written. However, it doesn’t talk about how these database instances communicate.
  2. This paper presents a elegant and effective solution to scale the relation database on cloud, basically it isolates the complex logic and heavy computation/storage to better use the cloud environment.
Survivability of Cloud Databases - Factors and Prediction
- Summary
  1. This paper presents a solution to predicting how long public cloud databases survive before being dropped on Azure SQL DB.
  2. They use Kaplan-Meier (KM) estimator to estimate the survival curve empirically.
  3. They formulate the problem of predicting whether a database will be live more than 30 days given 2 days of telemetry data. This is a classification problem and they learn a random forest for prediction. Features include creation time, server and database names (patterns), database size, edition and performance level, subscription type and subscription history.
- Comments
  1. This paper is well written. Although the method is relatively simple, it researches around an interesting and useful problem in cloud databases, which is like predicting user churn rate in subscription-based websites.

Graph Databases & Query Processing on Modern Hardware

RAPID: In-Memory Analytical Query Processing Engine with Extreme Performance per Watt
- Summary
  1. This paper presents RAPID, a relational query processing engine designed to support modern analytical workloads with an emphasis on architecture-conscious performance at lower power consumption compared to existing database systems. RAPID provides a novel design from scratch with hardware aware data/storage model, query optimizations and data processing operators.
  2. For the hardware architecture, RAPID adopts a Data Processing Unit (DPU) consisting of 32 data processing cores (dpCore), an on-chip programmable data movement engine Data Movement System (DMS), and a hardware block called the Atomic Transaction Engine (ATE) to communicate among dpCores efficiently.
  3. The RAPID software is integrated into the host DBMS by offloading the query to RAPID for execution and sending results back, therefore durability and persistence are provided by the host database system.
  4. RAPID stores the entire data in main-memory in columnar format, and it extensively uses decimal scaled binary number (DSB) encoding, dictionary encoding and run length encoding.
  5. RAPID query execution framework adopts: (1) push-based execution model to avoid deep call stacks and save valuable resources (e.g., instruction caches and program stack memory); (2) an actor model for parallelism; (3) hardware-aware design for relational data access; (4) vectorized query processing.
  6. RAPID’s query compiler and optimizer is a cost-based physical query optimizer working on top of the logical query optimizations by the host database. It uses “task” as the materialization point to stage pipelines. It also proposes some optimizations on partitioning and has optimized implementations for data processing operators. RAPID implements hash join by using a hash join kernel optimized for the DPU and DMEM.
- Comments
  1. This paper is well written. It points out a fruitful direction for future research: co-design of hardware and software for specific applications (e.g., database and deep learning).

Written on February 16, 2019

Some Good Papers in SIGMOD 2018 (Industry Sessions)

Session 1: Adaptive Query Processing

Computation Reuse in Analytics Job Service at Microsoft

Session 2: Real-time Analytics

Pinot: Realtime OLAP for 530 Million Users

Robust, Scalable, Real-Time Event Time Series Aggregation at Twitter

TcpRT: Instrument and Diagnostic Analysis System for Service Quality of Cloud Databases at Massive Scale in Real-time

Session 3: DB systems in the Cloud and Open Source

Amazon Aurora: On Avoiding Distributed Consensus for I/Os, Commits, and Membership Changes

Survivability of Cloud Databases - Factors and Prediction

Graph Databases & Query Processing on Modern Hardware

RAPID: In-Memory Analytical Query Processing Engine with Extreme Performance per Watt