Main Memory Database Systems: An Overview

Reference: Garcia-Molina, Hector, and Kenneth Salem. "Main memory database systems: An overview." IEEE Transactions on knowledge and data engineering 4.6 (1992): 509-516.

This is one of the several papers belong to suggested readings for In-Memory Databases of CMU 15-721: Database Systems.

0. Abstract

Memory resident database systems(MMDB’s) store their data in main physical memory and provide very high-speed access. Conventional database systems are optimized for the particular characteristics of disk storage mechanisms. Memory resident systems, on the other hand, use different optimizations to structure and organize data, as well as to make it reliable. This paper surveys the major memory residence optimizations and briefly discusses some of the memory resident systems that have been designed or implemented.

1. Introduction

  • MMDB(Main Memory Database System): the primary copy lives permanently in memory.
  • Difference between a MMDB and a DRDB(Disk Resident Database System) with a very large cache: (1) index structures; (2) buffer manager.
  • Memory backups will have to be taken relatively frequently, and the performance of the backup mechanism will be of central importance.

2. Impact of Memory Resident Data

2.1 Concurrency Control

  • Focus on Locking Concurrency Control
  • If contention is already low because data are memory resident, very large lock granules(e.g., relations) are most appropriate. In the extreme, the lock granule could be chosen to be the entire database, amounting to serial execution of transactions, by which the number of CPU cache flushes in greatly reduced.
  • However, serial transactions are probably not practical when long transactions(e.g., conversational transactions) are present. Further, multiprocessor systems may require some form of concurrency control even if all transactions are short.
  • By storing objects in the memory, we may be able to afford a small number of bits in them to represent their lock status, thus the lock and the release can be done with a minimal number of machine instructions.

2.2 Commit Processing

Problem

Before a transaction can commit, its activity records must be written to the log which resides in stable storage(e.g., redundant disk). However, logging can impact response time and throughput.

Solutions

  1. a small amount of stable main memory can be used to hold a portion of the log, and a special process or processor is then responsible for copying data from the stable memory to the log disks, which eliminate the response time problem but does not alleviate the log bottleneck.
  2. transactions can be pre-committed by releasing a transaction’s locks as soon as its log record is placed in the log, without waiting for the information to be propagated to the disk, which may reduce the blocking delays of other concurrent transactions. 3.group commits can be used to relieve a log bottleneck, under which the records of several transactions are accumulated in the memory, when enough have accumulated(e.g., when a page is full), all are flushed to the log disk in a single disk operation.

2.3 Access Methods

  • Index structures like B-Trees, which are designed for block-oriented storage, lose much of their appeal.
  • Trees such as the T-Tree have been designed explicitly for memory-resident databases. Main memory trees need not have the short, bushy structure of a B-Tree, since traversing deeper trees is much faster in main memory than on a disk.
  • Data values on which the index is built need not be stored in the index itself, as is done in B-Trees, by storing pointers to the indexed data. The use of pointers suggests simply inverting the relation on the indexed field, which is very space efficient and reasonably fast for range and exact-match queries, although updates are slow.

2.4 Data Representation

  • Relational tuples can be represented as a set of pointers to data values, which is space efficient and simplifies the handling of variable length fields.

2.5 Query Processing

  • Query processing techniques that take advantage of faster sequential access lose advantages since sequential access is not significantly faster than random access, for example, sort-merge join processing. Whereas some relational operations can be performed very efficiently, for example, joining two relations over a common attribute.
  • Query processors for memory resident data must focus on processing costs, whereas most conventional systems attempt to minimize data access. However, processing costs can be difficult to measure.

2.6 Recovery

  • Recovery: (1) the procedure used during normal database operation to keep the backup up-to-date; (2) the procedure used to recover from failure.
  • Checkpointing and failure recovery are the only reasons to access the disk-resident copy of the database, while one observation is that disk I/O should be performed using a very large block size.
  • Solutions to the restoration of large database: (1) load blocks of the database “on demand” until all of the data has been loaded; (2) use disk stripping or disk arrays such that the database is spread across multiple disks, and it is read in parallel.

2.7 Performance

  • The performance of a main memory database manager depends primarily on processing time, which contrasts with models of disk-based systems which count I/O operations to determine the performance of an algorithm.
  • The performance of backup or checkpointing algorithms for MMDB is much critical.

2.8 Application Programming Interface and Protection

  • In conventional DRDBs, applications exchange data with the database management system via private buffers.
  • In MMDB, access to objects can be more efficient: (1) applications may be given the actual memory position of the object; (2) eliminate the private buffer and give transactions direct access to the object.

2.9 Data Clustering and Migration

  • In a DRDB, data objects(e.g., tuples, fields) that are accessed together are frequently stored together, or clustered. In a MMDB, there is no need to cluster objects thus migration and dynamic clustering are components of a MMDB that have no counterpart in conventional database systems.

3. Systems

3.1 OBE

  • Implemented in conjunction with IBM’s Office-By-Example(OBE) database projects
  • Heavy use of pointers
  • Nested-loop joins
  • Index created “on-the-fly” to compute a join
  • Focus on reducing processing costs and processor intensive activities

3.2 MM-DBMS

  • Designed at the University of Wisconsin
  • Heavy use of pointers
  • Linear Hashing, T-Trees(a type of balanced binary tree with multiple values stored at each node)
  • Memory is divided into large self-contained blocks for recovery purposes, which are also the units of transfer to and from the backup disk resident database copy. After a failure, blocks are brought back into memory on demand
  • Two-phase locking for concurrency control with large lock granules(entire relations)

3.3 IMS/VS Fast Path

  • A commercial database product from IBM which supports memory and disk resident data
  • Group commits
  • Record-granule lock
  • Designed to handle very frequently accessed data, supporting VERIFY/CHANGE operations for frequently updated objects

3.4 MARS

  • Designed at Southern Methodist University
  • A database processor and a recovery processor
  • Two-phase locking for concurrency control with large lock granules(entire relations)

3.5 HALO

  • HArdware LOgging(HALO) is a proposed special device for transaction logging by transparently off-loading logging activity from the processors that execute transactions
  • HALO intercepts communications between a processor and memory controllers to produce a word-level log of all memory updates.

3.6 TPK

  • A prototype multiprocessor main-memory transaction processing system implemented at Princeton University
  • Consisted of a set of concurrent threads of four types: input, execution, output, and checkpoint
  • The execution thread executes transactions serially, thereby eliminating the need for transaction concurrency controls
  • Two copies of the database(primary and secondary) are retained in-memory. The purpose of the secondary database is to eliminate data contention between the checkpoint and execution threads during the checkpoint operation
  • Group commits

3.7 System M

  • A transaction processing testbed system developed at Princeton
  • Implemented as a collection of cooperating servers(threads) on the Mach operating systems while capable of processing transactions concurrently by attempting to keep the number of active transactions small
  • Pre-commit and group commits
  • As in MM-DBMS, the primary database copy is divided into self-contained fixed-size segments, which are the units of transfer to and from the backup disk
  • Fuzzy and consistent checkpoints
  • Physical and logical logging

4. Conclusion

  • Five Minute Rule: data that are referenced every 5 min or more should be memory resident(assuming 1K DISK blocks), which is arrived at by analyzing the dollar cost of accessing data in memory versus disk. Since the price of a byte of main memory drops relative to the cost of disk accesses per second, the resulting time grows
  • “5-min rule” implies that memory resident database systems will become more common in the future, and hence, the mechanisms and optimizations will become commonplace
  • This paper was written in 1992, and most of its declarations have become true
Written on December 4, 2016