What's Really New with NewSQL?Reference: Pavlo, Andrew, and Matthew Aslett. "What's Really New with NewSQL?." ACM Sigmod Record 45.2 (2016): 45-55.
This is one of the several papers belong to suggested readings for Course Introduction and History of Database Systems of CMU 15-721: Database Systems.
A new class of database management systems (DBMSs) called NewSQL tout their ability to scale modern on-line transaction processing (OLTP) workloads in a way that is not possible with legacy systems. The term NewSQL was first used by one of the authors of this article in a 2011 business analysis report discussing the rise of new database systems as challengers to these established vendors (Oracle, IBM, Microsoft). The other author was working on what became one of the first examples of a NewSQL DBMS. Since then several companies and research projects have used this term (rightly and wrongly) to describe their systems.
Given that relational DBMSs have been around for over four decades, it is justifiable to ask whether the claim of NewSQL’s superiority is actually true or whether it is simply marketing. If they are indeed able to get better performance, then the next question is whether there is anything scientifically new about them that enables them to achieve these gains or is it just that hardware has advanced so much that now the bottlenecks from earlier years are no longer a problem.
To do this, we first discuss the history of databases to understand how NewSQL systems came about. We then provide a detailed explanation of what the term NewSQL means and the different categories of systems that fall under this definition.
1. A Brief History of DBMSs
- mid 1960s, first DBMS, IBM’s IMS
- early 1970s, first relational DBMS, IBM’s System R and the University of California’s INGRES
- early 1980s, first commercial DBMSs, Sybase and Informix
- late 1980s and early 1990s, object-oriented DBMSs
- 1990s, MySQL and PostgreSQL
- 2000s, custom middleware to shard single-node DBMS over a cluster of less expensive machines, eBay’s Oracle-based cluster and Google/Facebook’s MySQL-based cluster
- mid to late 2000s, NoSQL, Google’s BigTable, Amazon’s Dynamo, Facebook’s Cassandra, PowerSet’s HBase and MongoDB
2. The Rise of NewSQL
- Definition of NewSQL: a class of modern relation DBMSs that seek to provide the same scalable performance of NoSQL for OLTP read-write workloads while still maintaining ACID guarantees for transactions.
- Targeted applications are characterized as executing read-write transactions that (1) are short-lived; (2) touch a small subset of data using index lookups; (3) are repetitive.
- A NewSQL system’s implementation has to use (1) a lock-free concurrency control scheme; (2) a shared-nothing distributed architecture.
3.1 New architecture
- Based on distributed architectures that operate on shared-nothing resources and contain components to support multi-node concurrency control, fault tolerance through replication, flow control and distributed query processing.
- Manages their own primary storage, either in-memory or on disk, thus allowing “send the query to the data”.
- Foremost downside is that many organizations are wary of adopting technologies that are too new and unvetted.
- Examples: Clustrix, CockroachDB, Google Spanner, H-Store, HyPer, MemSQL, NuoDB, SAP HANA, VoltDB.
3.2 Transparent Sharding Middleware
- Centralized middleware component routes queries, coordinates transactions, as well as manages data placement, replication, and partitioning across the nodes.
- Key Advantage: they are often a drop-in replacement for an application that is already using an existing single-node DBMS(i.e., MySQL proxy and Fabric).
- Disadvantages: (1) such systems still have to use a traditional disk-oriented DBMS on each node(e.g., MySQL, Postgres, Oracle), while previous research has shown that the legacy components of disk-oriented architectures is significant encumbrance that prevents these traditional DBMSs from scaling up to take advantage of higher CPU core counts and large memory capacity; (2) can also incur redundant query planning and optimization on sharded nodes for complex queries(i.e., once at the middleware and once on the individual DBMS nodes), but this does allow each node to apply their own local optimization on each query.
- Examples: AgilData Scalable Cluster(a.k.a, dbShards), MariaDB MaxScale, ScaleArc, ScaleBase.
- DBaaS provider is responsible for maintaining the physical configuration of the database, including system tuning(e.g., buffer pool size), replication and backups, while customers subscribe to a pricing tier that specifies the maximum resource utilization threshold(e.g., storage size, computation power, memory allocation) that the provider will guarantee.
- Examples: Amazon Aurora, ClearDB.
4. The State of The Art
4.1 Main Memory Storage
- Capacities and prices are such that it is affordable to store all but the largest OLTP databases entirely in memory, thus many of the components that are no longer needed, like a buffer pool manager or heavy-weight concurrency control schemes.
- History of Main Memory Database: 1980s, 1990s
- The ability to evict a subset of the database out to persistent storage to reduce its memory footprint, allowing the DBMS to support database that are larger than the amount of memory available without having to switch back to a disk-oriented architecture, e.g., H-Store’s anti-caching, VoltDB’s using of OS virtual memory paging, Project Siberia’s Bloom filter, and MemSQL’s columnar storage and log-structured storage to reduce the overhead of updates.
- Earlier distributed DBMSs never caught on for two reasons: (1) computing hardware in the 20th century was so expensive; (2) the application demand for a high-performance distributed DBMS was simply not there.
- Tables are horizontally divided into multiple fragments whose boundaries are based on the values of one(or more) of the table’s columns(i.e., the partitioning attributes), while DBMS is able to distribute the execution of a query to multiple partitions and then combine their results together into a single result.
- The database for many OLTP applications have a key property that makes them amenable to partitioning, their database schemas can be transposed into a tree-like structure where descendants in the tree have a foreign key relationship to the root. The tables are then partitioned on the attributes involved in these relationships such that all of the data for a single entity are co-located together in the same partition, thus allowing most transactions to only need to access data a a single partition.
- Heterogeneous Architecture DBMSs: NuoDB and MemSQL.
- Some of NewSQL systems support live migration, allowing the DBMS to move data between physical resources to re-balance and alleviate hotspots, or to increase/decrease the DBMS’s capacity without any interruption to service. There are two approaches to achieve this: (1) organize the database in many coarse-grained “virtual”(i.e., logical) partitions that are spread amongst the physical nodes; (2) perform more fine-grained re-balancing by redistributing individual tuples or groups of tuples through range partitioning.
4.3 Concurrency Control
- Almost all of the NewSQL systems based on new architectures eschew two-phase locking(2PL) schemes because the complexity of dealing with deadlocks. Instead, the current trend is to use variants of timestamp ordering(TO) concurrency control where the DBMS assumes that transactions will not execute interleaved operations that will violate serializable ordering.
- The most widely used protocol in NewSQL is decentralized multi-version concurrency control(MVCC) where the DBMS creates a new version of a tuple in the database when it is updated by a transaction. Other systems use a combination of 2PL and MVCC together, for example, MySQL’s InnoDB and Google’s Spanner.
- The only commercial NewSQL DBMS that is not using some MVCC variant is VoltDB, instead it schedules transactions to execute one-at-a-time at each partition, using a hybrid architecture where single-partition transactions are scheduled in a decentralized manner but multi-partition transactions are scheduled with a centralized coordinator.
- In general, there is no significantly new about currency control in NewSQL other than making these algorithms work well in the context of modern hardware and distributed operating environments.
4.4 Secondary Indexes
- A secondary index contains a subset of attributes from a table that are different than its primary key, allowing DBMS to support fast queries beyond primary key or partitioning key look-ups. The challenge with secondary indexes in a distributed DBMS is that they cannot always be partitioned in the same manner as with the rest of database.
- All of the NewSQL systems based on new architectures are decentralized and use partitioned secondary indexes, meaning that each node stores a portion of the index.
- Clustrix’s two-tier secondary index: (1) a replicated, coarse-grained(i.e., range-based) index at each node that maps values to partitions, allowing DBMS to route queries to the appropriate node using an attribute that is not the table’s partitioning attribute; (2) these queries will then access a second partitioned index at that node that maps exact values to tuples.
- Memcached can be used as external secondary index, but it requires the application to maintain the cache since the DBMSs will not automatically invalidate the external cache.
- To support strong consistence, the DBMS must use an atomic commitment(e.g., two-phase commit) to ensure that all replicas agree with the outcome of transactions, which has additional overhead and can lead to stalls if a node fails or if there is a network partition/delay. This is why NoSQL systems opt for a weakly consistent model(also called eventual consistency).
- All of the NewSQL systems support strongly consistent replication. There are two different execution models for how the DBMS propagates updates to replicas: (1) active-active replication, is where each replica node processes the same request simultaneously; (2) active-passive replication where a request is first processed at a single node and then the DBMS transfers the resultant state to the other replicas, which is implemented by most NewSQL systems since they use a non-deterministic concurrency control scheme.
- Spanner and CockroachDB provide a replication scheme that is optimized for strongly consistent replicas over the wide-area network(WAN) through a combination of atomic and GPS hardware clocks or hybrid clocks.
4.6 Crash Recovery
- Unlike traditional DBMSs where the main concern of fault tolerance is to ensure that no updates are lost, newer DBMSs must also minimize downtime.
- In a distributed DBMS with replicas, the traditional single-node approach that loads in the last checkpoint and then replays its write-ahead log(WAL) is not directly applicable, because the DBMS has continued t process transactions by promoting one of the slave nodes to the be new master. There are two potential ways to recover the previous master: (1) load in its last checkpoint and WAL from its local storage and then pull missed log entries; (2) discard its checkpoint and have system take a new one that the node will recover from to reduce the time for recovering.
- The NewSQL systems based on new architectures use a combination of off-the-shelf components(e.g., ZooKeeper, Raft) and their own custom implementations of existing algorithms(e.g., Paxos), which are standard procedures and technologies available in commercial distributed systems since the 1990s.
5. Future Trends
- The next trend for database applications in the near future is the ability to execute analytical queries and machine learning algorithms on freshly obtained data. Such workloads, colloquially known as “real-time analytics” or hybrid transaction-analytical processing(HTAP), seek to extrapolate insights and knowledge by analyzing a combination of historical data sets with new data.
- There are three approaches to supporting HTAP pipelines in a database application: (1) the most common is to deploy separate DBMSs: one for transactions and another for analytical queries; (2) another prevailing system design, known as the lambda architecture, is to use a separate batch processing system(e.g., Hadoop, Spark) to compute a comprehensive view on historical data, while simultaneously using a stream processing system(e.g., Storm, Spark Streaming) to provide views of incoming data. There are two major problems with the first two approaches: (1) the time it takes to propagate changes between the separate systems is often measured in minutes or even hours; (2) the administrative overhead of deploying and maintaining two different DBMSs is not trivial.
- The third approach(in the author’s opinion better) is to use a single HTAP DBMS that supports the high throughput and low latency demands of OLTP workloads, while also allowing for complex, longer running OLAP queries to operate on both hot(transactional) and cold(historical) data. What makes these newer HTAP systems different from legacy general-purpose DBMSs is that they incorporate the advancements from the last decade in the specialize OLTP(e.g., in-memory storage, lock-free execution) and OLAP(e.g., columnar storage, vectorized execution) systems, but within a single DBMS.
- Examples: SAP HANA and MemSQL.