Architecture of a Database System

Reference: Hellerstein, Joseph M., Michael Stonebraker, and James Hamilton. "Architecture of a database system." Foundations and Trends® in Databases 1.2 (2007): 141-259.

This is one of the several papers belong to suggested readings for Background of Readings in Database Systems, 5th Edition.

0. Abstract

Database Management Systems (DBMSs) are a ubiquitous and critical component of modern computing, and the result of decades of research and development in both academia and industry. Historically, DBMSs were among the earliest multi-user server systems to be developed, and thus pioneered many systems design techniques for scalability and reliability now in use in many other contexts. While many of the algorithms and abstractions used by a DBMS are textbook material, there has been relatively sparse coverage in the literature of the systems design issues that make a DBMS work. This paper presents an architectural discussion of DBMS design principles, including process models, parallel architecture, storage system design, transaction system implementation, query processor and optimizer architectures, and typical shared components and utilities. Successful commercial and open-source systems are used as points of reference, particularly when multiple alternative designs have been adopted by different groups.

1. Introduction

1.1 Relational Systems: The Life of a Query

Life of a Query

  • Connecting to Client Communications Manager
  • Assign a “thread of computation” by *Process Manager**
  • Invoke the code in the *Relational Query Processor**
  • The operators make calls to fetch data from Transactional Storage Manager
  • Send back results through Client Communications Manager

2. Process Models

2.1 Uniprocessors and Lightweight Threads

  • Process per DBMS Worker: scaling issue
  • Thread per DBMS Worker: without protection and isolation of OS, portability issue, scaling well
  • Process Pool: has all of the advantages of process per DBMS worker, with much less overhead

Shared Data Structures

  • Database I/O Requests: The Buffer Pool, implemented by shared memory (shared in thread per worker by design)
  • Log I/O Requests: The Log Tail, managed by a separate process or implemented by shared memory (share in thread per worker by design)
  • Client Communication Buffers: implemented by socket or client-side cursor caching.
  • Lock Table: similar with buffer pool

2.2 DBMS Threads

  • As the OS threads are not available at first, many DBMSs implemented their own proprietary, lightweight thread package.

2.3 Standard Practice

  • Summarized the process models supported by IBM DB2, MySQL, Oracle, PostgreSQL, and Microsoft SQL Server.

2.4 Admission Control

  • Thrashing: the DBMS cannot keep the “working set” of database pages in the buffer pool, and spends all its time replacing pages (as the result of too much pressure on memory or locks).
  • Admission control, which does not accept new work unless sufficient DBMS resources are available. With a good admission controller, a system will display graceful degradation under overload: transaction latencies will increase proportionally to the arrival rate, but throughput will remain at peak.
  • This can be implemented in two tiers: in the dispatcher to ensure that the number of client connections is kept below a threshold, or in the core DBMS relational query processor based on the estimated resource usage from the query optimizer. Many DBMSs use memory footprint and the number of active DBMS workers as the main criterion for admission control since memory pressure is typically the main cause of thrashing.

2.5 Discussion and Additional Material

  • New and fine-grained process models

3. Parallel Architecture: Processes and Memory Coordination

3.1 Shared Memory

  • All processors can access the same RAM and disk with roughly the same performance.
  • The dominant cost for DBMS customers is typically paying qualified people to administer high-end systems, i.e., Database Administrators (DBAs) and System Administrators.

3.2 Shared-Nothing

  • There is no way for a given system to directly access the memory or disk of another system.
  • Most common technique for shared-nothing systems: use horizontal data partitioning to allow each processor to execute on a portion of data independently of the others.
  • Typical data partitioning schemes: hash-based, range-based, round-robin, and hybrid (range-based + hash-based).
  • However, this simple partitioning scheme does not does not handle all issues in the DBMS, e.g., transaction completion, load balancing, certain maintenance tasks all require explicit cross-processor coordination.
  • To address partial failure, we may employ redundancy schemes ranging from full database failover (requiring double the number of machines and software licenses) to fine-grain redundancy like chained declustering.

3.3 Shared Disk

  • Shared-disk has become more common in recent years with the increasing popularity of Storage Area Networks (SAN). A SAN allows one or more logical disks to be mounted by one or more host systems making it easy to create shared disk configurations.
  • One potential advantage of shared-disk over shared-nothing systems is their lower cost of administration, since you don’t need to consider partitioning tables, and the failure of a single DBMS processing node does not affect the other nodes’ ability to access the entire database.
  • Shared-disk systems depend upon a distributed lock manager facility, and a cache-coherency protocol for managing the distributed buffer pools. These are complex software components, and can be bottlenecks for workloads with significant contention.

3.4 NUMA

  • Non-Uniform Memory Access (NUMA) systems provide a shared-memory programming model over a cluster of systems with independent memories. Each system in the cluster can access its own local memory quickly, whereas remote memory access across the high-speed cluster interconnect is somewhat delayed.
  • They are much easier to program than shared-nothing clusters, and also scale to more processors than shared-memory systems by avoiding shared points of contention such as shared-memory buses.
  • To avoid serious memory access bottlenecks caused by non-uniform memory access, we may employ optimizations: (1) when allocating memory for use by a processor, use memory local to that processor (avoid use of far memory); (2) ensure that a given DBMS worker is always scheduled if possible on the same hardware processor it was on previously.

3.5 DBMS Threads and Multi-processors

  • When running DBMS threads within multiple processes, to avoid unbalanced workloads (some processors are busy while others are idle), DBMSs must implement thread migration between processes.
  • A good rule of thumb is to have one process per physical processor.

3.6 Standard Practice

  • All major commercial DBMS providers support shared memory parallelism.

3.7 Discussion and Additional Material

  • The new generation of “many-core” architectures that are coming from the processor vendors may require DBMS architectures to be re-examined to meet the performance potential of the hardware.
  • For large data-centers with tens of thousands of computers, administrations and failure handling should be made automated.

4. Relational Query Processor

  • A relational query processor takes a declarative SQL statement, validates it, optimizes it into a procedural dataflow execution plan, and (subject to admission control) executes that dataflow program on behalf of a client program. The client program then fetches (“pulls”) the result tuples, typically one at a time or in small batches.
  • In general, relational query processing can be viewed as a single-user, single-threaded task, while concurrency control is managed transparently by lower layers of the system. The only exception to this rule is when the DBMS must explicitly “pin” and “unpin” buffer pool pages while operating on them so that they remain resident in memory during brief, critical operations.
  • Data Definition Language (DDL) are typically not processed by the query optimizer. They are usually implemented procedurally in static DBMS logic through explicit calls to the storage engine and catalog manager.

4.1 Query Parsing and Authorization

Given an SQL statement, the main tasks for the SQL Parser are to

  • Check that the query is correctly specified
  • Resolve names and references
  • Convert the query into the internal format used by the optimizer
  • Verify that the user is authorized to execute the query

Standard Operation Process

  1. Given an SQL query, the parser first considers each of the table references in the FROM clause. It canonicalizes table names into a fully qualified name of the form server.database.schema.table. This is also called a four part name.
  2. After canonicalizing the table names, the query processor then invokes the catalog manager to check that the table is registered in the system catalog. It may also cache metadata about the table in internal query data structures during this step. Based on information about the table, it then uses the catalog to ensure that attribute references are correct. Additional standard SQL syntax checks are also applied, including the consistent usage of tuple variables, the compatibility of tables combined via set operators, the usage of attributes in the SELECT list of aggregation queries, the nesting of sub-queries, and so on.
  3. If the query parses successfully, the next phase is authorization checking to ensure that the user has appropriate permissions (SELECT/DELETE/INSERT/UPDATE) on the tables, user defined functions, or other objects referenced in the query.
  4. If a query parses and passes validation, then the internal format of the query is passed on to the query rewrite module for further processing.

4.2 Query Rewrite

  • The query rewrite module, or rewriter, is responsible for simplifying and normalizing the query without changing its semantics. It can rely only on the query and on metadata in the catalog, and cannot access data in the tables.
  • The query rewrite module usually outputs an internal representation of the query in the same internal format that it accepted at its input.
  • The rewriter in many commercial systems is a logical component whose actual implementation is in either the later phases of query parsing or the early phases of query optimization.

The rewriter’s main responsibilities are:

  • View expansion: handling views is the rewriter’s main traditional role.
  • Constant arithmetic evaluation
  • Logical rewriting of predicates
  • Semantic optimization: e.g., redundant join elimination
  • Subquery flattening and other heuristic rewrites: e.g., query normalization

4.3 Query Optimizer

  • The query optimizer’s job is to transform an internal query representation into an efficient query plan for executing the query. A query plan can be thought of as a dataflow diagram that pipes table data through a graph of query operators.
  • In many systems, queries are first broken into SELECT-FROM-WHERE query blocks, the optimization of each individual query block is then done, and finally these various blocks are then stitched together.
  • To enable cross-platform portability, every major DBMS now compiles queries into some kind of interpretable data structure. The only difference between them is the intermediate form’s level of abstraction, e.g., relational algebra expressions, or “op-codes” closer in spirit to Java byte codes.

Among the main extensions of Selinger’s paper:

  • Plan space: the System R optimizer constrained its plan space somewhat by focusing only on “left-deep” query plans (where the right-hand input to a join must be a base table), and by “postponing Cartesian products” (ensuring that Cartesian products appear only after all joins in a dataflow). In commercial systems today, it is well known that “bushy” trees (with nested right-hand inputs) and early use of Cartesian products can be useful in some cases.
  • Selectivity estimation: Selinger paper used simple table and index cardinality, while most systems today use histograms and other summary statistics, and some also use sampling techniques.
  • Search Algorithms: dynamic programming optimization and “top-down” search scheme based on the techniques used in Cascades.
  • Parallelism: two-phase optimization, in which first a traditional single-system optimizer is invoked to pick the best single-system plan, and then this plan is scheduled across multiple processors or machines.
  • Auto-Tuning

A Note on Query Compilation and Recompilation

  • Query preparation is especially useful for form-driven, canned queries over fairly predictable data, which bypasses the overhead of parsing, rewriting, and optimizing.

4.4 Query Executor

  • Most modern query executors employ the iterator model.
  • PostgreSQL utilizes moderately sophisticated implementations of the iterators for most standard query execution algorithms.

Iterator Discussion

  • An important property of iterators is that they couple dataflow with control flow.
  • In most database applications, the performance metric of merit is time to query completion.
  • Parallelism and network communications can be encapsulated within special exchange iterators.

Where’s the Data?

  • In practice, each iterator is pre-allocated a fixed number of tuple descriptors, one for each of its inputs, and one for its output. A tuple descriptor is typically an array of column references, where each column reference is composed of a reference to a tuple somewhere else in memory, and a column offset in that tuple.
  • According to where the actual tuples being referenced are stored in memory, we have (1) buffer pool, BP-tuples; (2) heap, M-tuples.

Data Modification Statements

  • Halloween problem

4.5 Access Methods

  • Search argument (SARG in System R terminology)
  • Row References: physical row IDs, pointer

4.6 Data Warehouses

  • Workflow systems were constructed that would “scrape” data from operational OLTP systems and load it into a data warehouse. Such systems were branded “extract, transform, and load” (ETL) systems.

Bitmap Indexes

  • Efficient in both speed and storage
  • Expensive to update

Fast Load

  • Bulk loading may not be suitable for real-time applications now
  • MVCC is useful for historical queries

Materialized Views

  • Actual table corresponding to a logical view expression
  • Three aspects to Materialized View use: (a) selecting the views to materialize (automatic database tuning), (b) maintaining the freshness of the views, and (c) considering the use of materialized views in ad-hoc queries.

OLAP and Ad-hoc Query Support

  • Data cubes provide high performance for a predictable, limited class of queries. However, they are generally not helpful for supporting ad-hoc queries.

Optimization of Snowflake Schema Queries

  • Terminology: star schema, fact, dimension
  • Snowflake: a multi-level star schema

Data Warehousing: Conclusions

  • Greenplum (a parallelization of PostgreSQL)
  • Column stores have a huge advantage in the data warehouse space versus traditional storage engines in which the unit of storage is a table row.

4.7 Database Extensibility

  • Abstract Data Types
  • Structured Types and XML
  • Full-Text Search

4.8 Standard Practice

4.9 Discussion and Additional Material

  • A good starting point for query optimization research is Chaudhuri’s short survey
  • For query processing research, Graefe offers a very comprehensive survey
  • Rich statistical methods over large data sets
  • Data mining techniques

5. Storage Management

5.1 Spatial Control

  • Where to place the data
  • We may create a very large file as the storage layer. DBMS vendors typically no longer recommend raw storage, and few customers run in this configuration, because large files provide comparable performance.

5.2 Temporal Control

  • When to place the data
  • Performance issues: (1) speculative read/write in OS and database, (2) double buffering in OS and database.
  • In practice, throughput in a well-tuned transaction processing DBMS is typically not I/O-bound.

5.3 Buffer Management

  • The buffer pool is organized as an array of frames, where each frame is a region of memory the size of a database disk block. Blocks are copied into the buffer pool from disk without format change, manipulated in memory in this native format, and later written back.
  • Hash table, dirty bit, page replacement, pin count
  • Page replacement scheme: LRU-2

5.4 Standard Practice

5.5 Discussion and Additional Material

  • New storage hardware
  • Compression (which may be meaning for saving processor caches)
  • Sparse data storage
  • Column-oriented data storage

6. Transactions: Concurrency Control and Recovery

6.1 A Note on ACID

  • Atomicity: either all of a transaction’s actions commit or none do
  • Consistency: Given a definition of consistency provided by a set of constraints, a transaction can only commit if it leaves the database in a consistent state.
  • Isolation: two concurrent transactions will not see each other’s in-flight (not yet-committed) updates.
  • Durability: the updates of a committed transaction will be visible in the database to subsequent transactions independent of subsequent hardware or software errors, until such time as they are overwritten by another committed transaction.


  • Durability is typically implemented via logging and recovery.
  • Isolation and Atomicity are guaranteed by a combination of locking (to prevent visibility of transient database states), and logging (to ensure correctness of data that is visible).
  • Consistency is managed by runtime checks in the query executor: if a transaction’s actions will violate a SQL integrity constraint, the transaction is aborted and an error code returned.

6.2 A Brief Review of Serializability

  • Serializability: the sequence of interleaved actions for multiple committing transactions must correspond to some serial execution of the transactions. Isolation is the same idea from the point of view of a single transaction.

Concurrency Control Techniques

  • Strict two-phase locking (2PL): transactions acquire a shared lock on every data record before reading it, and an exclusive lock on every data item before writing it. All locks are held until the end of the transaction, at which time they are all released atomically. A transaction blocks on a wait-queue while waiting to acquire a lock.
  • Multi-Version Concurrency Control (MVCC): transactions do not hold locks, but instead are guaranteed a consistent view of the database state at some time in the past, even if rows have changed since that fixed point in time.
  • Optimistic Concurrency Control (OCC): multiple transactions are allowed to read and update on item without blocking. Instead, transactions maintain histories of their reads and writes, and before committing a transaction checks history for isolation conflicts they may have occurred; if any are found, one of the conflicting transactions is rolled back.

  • Most commercial relational DBMSs implement full serializability via 2PL.
  • In order to reduce locking and lock conflicts some DBMSs support MVCC or OCC, typically as an add-on to 2PL.

6.3 Locking and Latching

  • Each lock is associated with a transaction and each transaction has a unique transaction ID.
  • Lock table, transaction table
  • Deadlock detector checks wait-for cycles
  • As an auxiliary to database locks, lighter-weight latches are also provided for mutual exclusion for internal DBMS data structures.

Differences between Latches and Locks

  • Locks are kept in the lock table and located via hash tables; latches reside in memory near the resources they protect, and are accessed via direct addressing.
  • In a strict 2PL implementation, locks are subject to the strict 2PL protocol. Latches may be acquired or dropped during a transaction based on special-case internal logic.
  • Lock acquisition is entirely driven by data access, and hence the order and lifetime of lock acquisitions is largely in the hands of applications and the query optimizer. Latches are acquired by specialized code inside the DBMS, and the DBMS internal code issues latch requests and releases strategically.
  • Locks are allowed to produce deadlock, and lock deadlocks are detected and resolved via transactional restart. Latch deadlock must be avoided; the occurrence of a latch deadlock represents a bug in the DBMS code.
  • Latches are implemented using an atomic hardware instruction or, in rare cases, where this is not available, via mutual exclusion in the OS kernel.
  • Latch calls take at most a few dozen CPU cycles whereas lock requests take hundreds of CPU cycles.
  • The lock manager tracks all the locks held by a transaction and automatically releases the locks in case the transaction throws an exception, but internal DBMS routines that manipulate latches must carefully track them and include manual cleanup as part of their exception handling.
  • Latches are not tracked and so cannot be automatically released if the task faults.

Transaction Isolation Levels

  • READ UNCOMMITTED: a transaction may read any version of data, committed or not. This is achieved in a locking implementation by read requests proceeding without acquiring any locks.
  • READ COMMITTED: a transaction may read any committed version of data. Repeated reads of an object may result in different (committed) versions. This is achieved by read requests acquiring a read lock before accessing an object, and unlocking it immediately after access.
  • REPEATABLE READ: a transaction will read only one version of committed data; once the transaction reads an object, it will always read the same version of that object. This is achieved by read requests acquiring a read lock before accessing an object, and holding the lock until end-of-transaction.
  • SERIALIZABLE: full serializable access is guaranteed.

  • Phantom problem

6.4 Log Manager

  • Canonical reference: ARIES

Write-Ahead Logging

  • Each modification to a database page should generate a log record, and the log record must be flushed to the log device before the database page is flushed.
  • Database log records must be flushed in order; log record r cannot be flushed until all log records preceding r are flushed.
  • Upon a transaction commit request, a commit log record must be flushed to the log device before the commit request returns successfully.


  • Data objects are updated in place
  • Unpinned buffer pool frames can be “stolen” (and the modified data pages written back to disk) even if they contain uncommitted data
  • Buffer pool pages need not be “forced” (flushed) to the database before a commit request returns to the user.

  • Combing DIRECT, STEAL/NOT-FORCE with DIRECT NOT-STEAL/NOT-FORCE, in which pages are not stolen unless there are no clean pages remaining in the buffer pool, in which case system degrades back to a STEAL policy.
  • Logical operations v.s. physical operations: in practice, a mixture of physical and logical logging (so called “physiological” logging) is used. In ARIES, physical logging is generally used to support REDO, and logical logging is used to support UNDO.
  • Checkpoint: computing recovery log sequence number (LSN) periodically
  • Fuzzy checkpoint of ARIES: containing just enough information to initiate the log analysis process and to enable the recreation of main-memory data structures lost at crash time.

6.5 Locking and Logging in Indexes

Latching in B+-Trees

  • The key insight in these schemes is that modifications to the tree’s physical structure (e.g., splitting pages) can be made in a non-transactional manner as long as all concurrent transactions continue to find the correct data at the leaves.
  • Approaches: (1) conservative schemes; (2) latch-coupling schemes: the tree traversal logic latches each node before it is visited, only unlatching a node when the next node to be visited has been successfully latched.; (3) right-link schemes: adding a link from each node to its right-hand neighbor to minimize the requirement for latches and re-traversals.

Logging for Physical Structures

  • The main idea is that structural index changes need not be undone when the associated transaction is aborted.
  • Nested top actions in ARIES: allows the recovery process to “jump over” log records (redo-only ones) for physical structure modifications during recovery without any special-case code.

Next-Key Locking: Physical Surrogates for Logical Properties

  • An insertion of a tuple with index key k must allocate an exclusive lock on the next-key tuple that exists in the index, where the next-key tuple has the lowest key greater than k; a read transactions should get a shared lock on the next-key tuple in the index.
  • Next-key locking is an example of using a physical object (a currently-stored tuple) as a surrogate for a logical concept (a predicate).

6.6 Interdependencies of Transactional Storage

  • Write-ahead logging requires strict two-phase locking.
  • It is a significant intellectual and engineering challenge to take a textbook access method algorithm and implement a correct, high-concurrency, recoverable version in a transactional system. For this reason, most leading DBMSs still only implement heap files and B+-trees as transactionally protected access methods.
  • Buffer management is relatively well-isolated, but may incur much of the complexity of in concurrency and recovery.

6.7 Standard Practice

  • Most DBMSs use write-ahead logging for durability, and two-phase locking for concurrency control, while PostgreSQL uses MVCC.
  • MySQL’s default storage engine, MyISAM, only supports table-level locking, but is considered the high-performance choice for read-mostly workloads.
  • For read/write workloads, the InnoDB storage engine is recommended; it offers row-level locking.
  • Neither of the MySQL storage engines provide the well-known hierarchical locking scheme developed for System R, despite its universal use in the other database systems.

6.8 Discussion and Additional Material

  • Transaction mechanisms are by now an extremely mature topic, and most of the possible tricks have been tried in one form or another over the years.

7. Shared Components

7.1 Catalog Manager

  • The catalog records the names of basic entities in the system (users, schemas, tables, columns, indexes, etc.) and their relationships, and is itself stored as a set of tables in the database, which is an important lesson learned.
  • The basic catalog data is treated somewhat differently (e.g., de-normalizing, main-memory, special-case transactional tricks) from normal tables for efficiency reasons.

7.2 Memory Allocator

  • Context-based memory allocator
  • Memory contexts serves as a lower-level, programmer-controllable alternative to garbage collection.
  • Memory contexts can use semantic knowledge (via the context type) of how memory will be allocated and deallocated, and may call malloc() and free() accordingly to minimize OS overheads.

A Note on Memory Allocation for Query Operators

  • Different systems may or may not expect DBAs to control memory allocation

7.3 Disk Management System

  • Large DBMS installations, for example, commonly use Storage Area Network (SAN) SANs.
  • The use (and misuse) of RAID devices is a fact that commercial DBMSs must take into account.
  • Most enterprise DBMS storage is SAN-hosted today.

7.4 Replication Services

  • Physical Replication
  • Trigger-Based Replication
  • Log-Based Replication: most difficult but most useful

7.5 Administration, Monitoring, and Utilities

Online Utilities

  • Optimizer Statistics Gathering
  • Physical Reorganization and Index Construction
  • Backup/Export
  • Bulk Load
  • Monitoring, Tuning and Resource Governers
Written on December 22, 2017