Readings in Database Systems (Fifth Edition)Reference: Edition, Fifth. "Readings in Database Systems."
- XML is outdated.
- JSON is a reasonable choice for sparse data, but it is a disaster in the making as a general hierarchical data format.
- Map-Reduce is not an architecture with any broad scale applicability, but HDFS seems useful infrastructure.
- It is now hard to find an application area where legacy row stores are competitive. As such, they deserve to be sent to the “home for retired software”.
- The basic architecture of DBMSs remains intact.
- The impending arrival of NVRAM may provide an opportunity for new architectural concepts, or a reemergence of old ones.
2. Traditional RDBMS Systems
- System R: (1) transaction manager; (2) dynamic programming cost-based approach is still the gold standard for optimizer technology; (3) the SQL syntax is not well cleaned up; (4) it used a subroutine call interface (now ODBC) to couple a client application to the DBMS, which is too complicate.
- Postgres: (1) abstract data type (ADT) system; (2) open source community is very helpful for development; (3) helped educate a generation of highly trained DBMS implementations.
- Gamma: (1) popularized the shared-nothing partitioned table approach to multi-node data management; (2) hash joins; (3) essentially all data warehouse systems use a Gamma-style architecture.
3. Techniques Everyone Should Know
- Selinger et al.’s foundation paper on System R enables practical query optimization by decomposing the problem into three distinct subproblems: cost estimation, relational equivalences that define a search space, and cost-based search.
- The optimizer provides an estimate for the cost of executing each component of the query, measured in terms of I/O and CPU costs, through both pre-computed statistics about the contents of each relation (stored in the system catalog) as well as a set of heuristics for determining the cardinality (size) of the query output (e.g., based on estimated predicate selectivity).
- Using these cost estimates, the optimizer uses a dynamic programming algorithm to construct a plan for the query. The optimizer defines a set of physical operators that implement a given logical operator, and iteratively constructs a “left-deep” tree of operators that in turn uses the cost heuristics to minimize the total amount of estimated work required to run the operators. This avoids having to consider all possible orderings of operators but is still exponential in the plan size (e.g., multi-way joins).
- Multi-granularity locking.
- Multiple degrees of isolation
- The optimal strategy for concurrency control is workload-dependent.
- The ability to perform so-called “back of the envelope” calculations is a valuable skill.
- For analysis of complex systems such as concurrency control, simulation can be a valuable intermediate step between back of the envelope and full-blown systems benchmarking.
- In ARIES, the database need not write dirty pages to disk at commit time (“No Force”), and the database can flush dirty pages to disk at any time (“Steal”), which allows for high performance.
- Three stages of recovery for ARIES: (1) ARIES performs an analysis phase by replaying the log forwards in order to determine which transactions were in progress at the time of the crash; (2) ARIES performs a redo stage by (again) replaying the log and (this time) performing the effects of any transactions that were in progress at the time of the crash; (3) ARIES performs an undo stage by playing the log backwards and undoing the effect of uncommitted transactions.
- Atomic commitment (AC), Two-Phase Commit (2PC)
- Consensus: Paxos, Viewstamped Replication, Raft, ZAB, and Multi-Paxos
4. New DBMS Architectures
- Column stores are dramatically superior to row stores in the data warehouse marketplace, since: (1) less data movement from the disk to main memory; (2) better compression performance; (3) efficient inner loop in a column-based executor.
- In effect, the OLTP marketplace is now becoming a main memory DBMS marketplace. Again, traditional disk-based row stores are just not competitive.
- “No-SQL” movement: (1) “out of box” experience, in which they are easy for a programmer to get going and do something productive; (2) support for semi-structured data.
5. Large-Scale Dataflow Engines
History and Successors
- MapReduce provides a very low-level interface (two-stage dataflow) that is closely tied to a fault-tolerant execution strategy (intermediate materialization between two-stage dataflow). Equally importantly, MapReduce was designed as a library for parallel programming rather than an end-to-end data warehousing solution.
- DryadLINQ is most interesting for its interface: a set of embedded language bindings for data processing that integrates seamlessly with Microsoft’s .NET LINQ to provide a parallelized collections library.
Impact and Legacy
- Schema flexibility. As a result, extract-transform-load (ETL) tasks are major workload for post-MapReduce engines.
- Interface flexibility.
Architectural flexibility. A common critique of RDBMSs is that their architecture is too tightly coupled, with a lack of clear interfaces between components in practice.
- A dominant theme in today’s distributed data management infrastructure is flexibility and heterogeneity: of storage formats, of computation paradigms, and of systems implementations.
- Specialized systems vs. one universe analytics engine
- Spark makes fast progress and are highly expected as the successor to MapReduce
Commentary: Michael Stonebraker
- Map-Reduce suffers from the following two problems: (1) it is inappropriate as a platform on which to build data warehouse products; (2) it is inappropriate as a platform on which to build distributed applications, since interface is not flexible enough and using file system as a message passing system is too slow to be interesting.
- In summary, Map-Reduce has failed as a distributed systems platform, and vendors are using HDFS as a file system under data warehouse products.
- In effect, Spark is being used as a SQL engine, not as a distributed applications platform (according to Matei Zaharia, more than 70% of the Spark accesses are through SparkSQL).
6. Weak Isolation and Distribution
Overview and Prevalence
- Transaction processing expert Phil Bernstein suggests that serializability typically incurs a three-fold performance penalty on a single-node database compared to one of the most common weak isolation levels called Read Committed. Even worse for distributed databases.
- As a result, instead of implementing serializability, database system designers instead often implemented weaker models.
The Key Challenge: Reasoning about Anomalies
- Most people are not aware of the weak isolation levels.
- The specifications for these weak isolation levels are ambiguous and under-specified.
- The most compelling argument for why weak isolation seems to be “okay” in practice is that few applications today experience high degrees of concurrency.
Weak Isolation, Distribution, and “NoSQL”
- Non-serializable isolation is prevalent in practice (in both classical RDBMSs and recent NoSQL upstarts) due to its concurrency-related benefits.
- Despite this prevalence, many existing formulations of non-serializable isolation are poorly specified and difficult to use.
- Research into new forms of weak isolation show how to preserve meaningful semantics and improve programmability without the expense of serializability.
7. Query Optimization
- System R optimizer
- Exodus, Volcano, Cascades (notable for extensive design and top-down/goal-oriented search)
Adaptive Query Processing
- Eddies: the optimizer is encapsulated as a dataflow operator that is itself interposed along other dataflow edges, and monitor the rates of dataflows along those edges, thus dynamically control the rest of the aspects of query planning via dataflow routing.
- Progressive Optimization: it focuses on inter-operator reoptimization, and some operators are blocking and consume their entire input before producing any output, which presents an opportunity after input is consumed to compare observed statistics to optimizer predictions, and reoptimize the “remainder” of the query plan using traditional query optimization technique.
- These two architectures for adaptivity could in practice coexist.
- Query optimization was mostly ignored but now starting being deployed in most systems
8. Interactive Analytics
- For decades, most database workloads have been partitioned into two categories: (1) many small “transaction processing” queries that do lookups and updates on a small number of items in a large databases; (2) fewer big “analytic” queries that summarize large volumes of data for analysis.
Methods: pre-computation or sampling.
- First paper reduces the computation needed for answering analytical queries, which chooses a judicious subset of queries in the cube that are worthy of pre-computation; it then uses the results of those queries to compute the results to any other query in the cube.
- Second paper shows that even for relational tables, it is worthwhile to convert tables to arrays in order to run this algorithm, rather than to run a (far less efficient) traditional relational algorithm. Specialized systems may be inspiring for general-purpose system.
- Third paper attempts to handle ad-hoc queries quickly without pre-computation by producing incrementally refining approximate answers. Online aggregation typically makes use of sampling to achieve incrementally refining results.
- Fourth paper (BlinkDB) makes uses of materialized sample views: precomputed samples over base tables, stored to speed up approximate query answering.
- Approximate queries via sketches are in fact very widely used by engineers and data scientists in the field today as building blocks for analysis.
- In practice, the majority of database users are software engineers, who build database-backed applications that are used further up the stack.
- Transaction model and declarative query languages are two important abstractions offered by DBMS.
Database Language Embeddings: Pascal/R
- Database “connectivity” APIs: ODBC, JDBC
- Object Relation Mapping (ORM)
Stream Queries: CQL
- CQL evolves SQL just enough to isolate the key distinctions between querying “resting” tables and “moving” streams, while many other query languages often seem similar and yet strangely different than SQL.
- Event programming and stream programming are quite similar with events viewed as data.
Programming Correct Applications without Transactions: Bloom
- CALM theorem: you can find a consistent, distributed, coordination-free implementation for your program if and only if its specification is monotonic.
10. Web Data
- Search Engine
- Web Table
11. A Biased Take on a Moving Target: Complex Analytics
- One would expect performance to improve as one moves from lower left to upper right in Table 1
- Most complex analytics reduce to a small collection of “inner loop” operations, such as matrix multiply, singular-value decomposition and QR decomposition.
- Codes that provide approximate answers are way faster than ones that produce exact answers.
- High Performance Computing (HPC) hardware are generally configured to support large batch jobs.
- Scalable data science codes invariably run on multiple nodes in a computer network and are often network-bound.
- Most analytics codes that we have tested fail to scale to large data set sizes, either because they run out of main memory or because they generate temporaries that are too large.
- The structure of your analytics pipeline is crucial.
Commentary: Joe Hellerstein
- New approaches to scalability
- Distributed infrastructure for analytic services
- Analytic lifecycle and metadata management
12. A Biased Take on a Moving Target: Data Integration
- ETL (Extract, Transform, Load)
- Master Data Management (MDM)
- Schema integration
- Consolidate (deduplication)