A Survey of B-Tree Locking TechniquesReference: Graefe, Goetz. "A survey of B-tree locking techniques." ACM Transactions on Database Systems (TODS) 35.3 (2010): 16.
This is one of the several papers belong to suggested readings for Index Locking & Latching of CMU 15-721: Database Systems.
B-trees have been ubiquitous in database management systems for several decades, and they are used in other storage systems as well. Their basic structure and basic operations are well and widely understood including search, insertion, and deletion. Concurrency control of operations in B-trees, however, is perceived as a difficult subject with many subtleties and special cases. The purpose of this survey is to clarify, simplify, and structure the topic of concurrency control in B-trees by dividing it into two subtopics and exploring each of them in depth.
1.1 Historical Background
- B+-Tree: all data records are in leaf nodes and keys in non-leaf “interior” nodes act merely as separators enabling search and other operations but not carrying logical database contents.
3. Two Forms of B-Tree Locking
- Concurrency control among concurrent database transactions querying or modifying database contents and its representation in B-tree indexes.
- Concurrency control among concurrent threads modifying the B-tree data structure in memory, including in particular images of disk-based B-tree nodes in the buffer pool.
3.1 Locks and Latches
- Locks separate transactions using read and write locks on pages, on B-tree keys (key value locking), or even on gaps (open intervals) between keys (key range locking).
- Latches separate threads accessing B-tree pages cached in the buffer pool, the buffer pool’s management tables, and all other in-memory data structures shared among multiple threads.
- Lock acquisition and release is expensive, often thousands of CPU cycles, some of those due to cache faults in the lock manager’s hash table and its linked lists. While latch acquisition and release may require tens of instructions only, usually with no additional cache faults since a latch can be embedded in the data structure it protects.
3.2 Recovery and B-Tree Locking
- While waiting, no latches are required, but retaining locks in essential to guarantee a local transaction’s ability to abide by the global coordinator’s final decision.
- During recovery without concurrent execution of new transactions, locks are not required, because concurrency control during forward processing prior to the system crash already ensured that active transactions do not conflict. However, Latches are as important during recovery as during normal forward processing if recovery employs multiple threads and shared data structures such as the buffer pool.
- Forward recovery may be faster and require log volume than rollback.
3.3 Lock-free B-Trees
- The in-memory format that pessimistic concurrency control in form of latches is not required: multiple in-memory copies and atomic updates of in-memory pointers might enable such an implementation.
- Avoidance of locks for coordination of user transactions: a typical implementation mechanism would require multiple versions of records based on multi-version concurrency control.
- Latching and locking serve different functions for B-tree indexes in database systems, not only during normal transaction processing, but also during off-line activities such as crash recovery.
4. Protecting a B-Tree’s Physical Structure
- The simplest form of a latch is a mutex (mutual exclusion lock), in which any access precludes concurrent access. Latches with shared and exclusive (read and write) modes are common in database systems.
- Since latches protect in-memory data structures, the data structure representing the latch can be embedded in the data structure being protected.
- Latch duration is usually very short and amenable to hardware support and encapsulation in transactional memory.
- Access to the lock table and its components is protected by latches in the data structures.
Latches ensure the consistency of data structures when accessed by multiple concurrent threads. All the issues are about the consistency of in-memory data structures, including images of disk pages in the buffer pool.
- A page image in the buffer pool must not be modified (written) by one thread while it is interpreted (read) by another thread.
- While following a pointer (page identifier) from one page to another, for example, from a parent node to a child node in B-tree index, the pointer must not be invalidated by another thread. This issue requires more refined solution than the first issue because it is not advisable to retain a latch while performing I/O.
- “Pointer chasing” applies not only to parent-child pointers but also to neighbor pointers, such as, in a chain of leaf pages during a range scan or while searching for the key to lock in the key range locking. Recall that latches usually rely on developer discipline for deadlock avoidance, not on automatic deadlock detection and resolution.
- During a B-tree insertion, a child node may overflow and require an insertion into its parent node, which may thereupon also overflow and require an insertion into the child’s grandparent node.
4.2 Locking Coupling
- The second issue in Section 4.1 requires retaining the latch on the parent node until the child node is latched, which is traditionally called “lock coupling” (but a better term is “latch coupling”).
- The third issue: asynchronous read-ahead may alleviate the frequency of buffer faults, deadlock avoidance among scans in opposite directions requires that latch acquisition code provides an immediate failure mode.
- Five approaches: (1) latching all nodes in exclusive mode; (2) performing the root-to-leaf search using shared latches and attempts an upgrade to an exclusive latch when necessary; (3) using “update” or “upgrade” latches in addition to traditional shared and exclusive latches; (4) splitting nodes pro-actively during a root-to-leaf traversal for an insertion; (5) protecting its initial root-to-leaf search with shared latches, aborting this search when a node requires splitting, restarting a new one, and upon reaching the node requiring a split, acquiring an exclusive latch and performing the split.
- -tree divides a node split into two independent steps, the first step creates the high fence key and a new right neighbor, and the second step posts the high fence key in the parent.
- Advantage: allocation of a new node and its initial introduction into the B-tree is a local step, affecting only one preexisting node
- Disadvantage: search may be a bit less efficient, a solution is needed to prevent long linked lists among neighbor nodes during periods of high insertion rates, and verification of a B-tree’s structural consistency is more complex and perhaps less efficient.
4.4 Load Balancing and Reorganization
- In many implementations, removal of nodes, load balancing among nodes, etc. are omitted in user transactions and left to asynchronous facilities.
- There are corresponding optimizations in logging and recovery: (1) avoiding logging the page contents by careful write ordering; (2) employ “forward recovery” to complete a reorganization action after system restart instead of rollback.
- Latches coordinate multiple execution threads while they access and modify database contents and its representation.
- In practice, modes beyond shared and exclusive are avoided, and deadlock avoidance is favored over deadlock detection.
- Latches should be held only for very short periods. None should be held during disk I/O. Thus, all required data structures should be loaded and pinned in the buffer pool before a latched operation begins.
5. Protecting a B-Tree’s Logical Contents
- For serializability, read locks are retained until end-of-transaction. Write locks are always retained until end-of-transaction in order to ensure the ability to roll back all changes if the transaction aborts.
- Weaker retention schemes exist for both read locks and write locks. Shorter read locks lead to weak transaction isolation levels; shorter write locks may lead to cascading aborts among transactions that read and modify the same database item.
5.1 Key Range Locking
- Key range locking is a special form of predicate locking, while either pure predicate locking nor the more practical precision locking has been adopted in major products.
- In the simplest form of key range locking, a key and the gap to the neighbor are locked as a unit.
5.2 Key Range Locking and Ghost Records
- In many B-tree implementations, the user transaction requesting a deletion by marking the record as invalid. Space reclamation happens in an asynchronous clean-up transaction.
- Advantage: (1) avoiding space allocation during rollback; (2) needing not log the deleted record, and the clean-up transaction; (3) requiring a lock until end-of-transaction only on the key but not on the neighboring gaps.
- Each B-tree may include the lowest and highest possible keys, such that key range locking never requires navigation to neighbor nodes.
- In addition to deletion, ghost records can also improve the performance and concurrency of insertions.
5.3 Key Range Locking as Hierarchical Locking
- Hierarchical or multi-granularity locking is able to lock the key value and the gap between keys with a single lock.
- Before any S or IS lock is taken on any item, an IS lock is required on the larger item containing it. X and IX locks work similarly, and an IX lock also permits acquiring an S lock.
- The large granularity of locking is the half-open interval, and the small granularities of locking are either the key value or the open interval. The advantage of this design is that locking a key(or a open interval) requires two invocations of the lock manager (half-open interval + key).
5.4 Locking in Non-unique Indexes
- In non-unique indexes, key value locking may lock each value (including its entire cluster of row identifiers) or it may lock each unique pair of value and row identifier. The former saves lock requests in search queries, while the latter may permit higher concurrency during updates.
5.5 Increment Lock Modes
- The combination of read lock and increment lock is equivalent to a write lock, because it excludes all other transactions from holding any kind of lock at the same time.
- Key range locking is the technique of choice for concurrency control among transactions.
6. Future Directions
- Some implementations techniques need to be revised for optimal use of modern many-core processors
7. Summary and Conclusions
- Two Levels: (1) concurrent threads accessing in-memory data structures (implemented with latches) and concurrent transactions accessing database contents (implemented with locks).
- Latches support a limited set of modes such as shared and exclusive, they do not provide advanced services such as deadlock detection or escalation, and they can often be embedded in the data structures they protect. Therefore, their acquisition and release can be very fast, which is important as they implement short critical sections in the database system code.
- Locks support many more modes than latches and provide multiple advanced services. Management of locks is separate from the protected information, for example, keys and gaps between keys in the leaf page of a B-tree index. The hash table in the lock manager is in fact protected itself by latches such that many threads can inspect or modify the lock table as appropriate.
- The principal technique for concurrency control among transactions accessing B-tree contents is key range locking. Various forms of key range locking have been designed. The most recent design permits separate locks on individual key values and on the gaps between key values, applies strict multi-granularity locking to each pair of a key and a neighboring gap, reduces lock manager invocations by using additional lock modes that can be derived automatically, enables increment locks in grouped summary views, and exploits ghost records not only for deletions for but also for insertions.