Establishing Common Ground with Data ContextReference: Hellerstein, Joseph M., et al. "Ground: A Data Context Service." CIDR. 2017.
This is one of the several papers belong to suggested readings for In-Memory Databases of CMU 15-721: Database Systems.
Ground is an open-source data context service, a system to manage all the information that informs the use of data. Data usage has changed both philosophically and practically in the last decade, creating an opportunity for new data context services to foster further innovation. In this paper we frame the challenges of managing data context with basic ABCs: Applications, Behavior, and Change. We provide motivation and design guidelines, present our initial design of a common metamodel and API, and explore the current state of the storage solutions that could serve the needs of a data context service. Along the way we highlight opportunities for new research and engineering solutions.
1. From Crisis to Opportunity
- Two conservative design patterns: (1) tight control of schemas and data ingest; (2) all components of a DBMS are designed to be work closely together.
- Big Data Movement: (1) focus on open-ended schema-on-use data; (2) all components now are independent and interchangeable.
1.1 Crisis: Big Metadata
- Crisis: the lack of a standard mechanism to assemble a collective understanding of the origin, scope and usage of the data(a.k.a, the metadata) they manage.
- Two significant classes of end-user problems: (1) poor productivity; (2) governance risk.
- Motivation: the need for a common service layer to support the capture, publishing and sharing of metadata information in a flexible way.
1.2 Opportunity: Data Context
Three Key Sources of Information
- Applications: the core information that describes how raw bits get interpreted for use.
- Behavior: the information about how data was created and used over time.
- Change: the information about the version history of data and associated code, including changes over time to both structure and content.
2. Ground: Scenarios and Design
2.1 Scenario: Context-Enabled Analytics
2.2 Design Requirements
- Model-Agnostic: the context service cannot prescribe how metadata is modeled.
- Immutable: the context service should keep history.
- Scalable: in many Big Data settings, it is reasonable to envision the data context being far larger than the data itself.
- Politically Neutral
3. Architecture of Ground
3.1 Key Services
- Ingest: Insertion, Crawlers and Queues: metadata may be pushed in Ground or require crawling and it is passed by queues.
- Versioned Metadata Storage: flexible version management of code and data, general-purpose model graphs and lineage storage.
- Search and Analyze
- Identity and Authorization
- Scheduling, Workflow, Reproducibility: integrate with a variety of schedulers and execution frameworks including on-premises and cloud-hosted approaches.
3.2 The Common Ground Metamodel
Version Graphs: Representing Change
- This layer bootstraps the representation of all information in Ground, by providing the classes upon which all other layers are based. The main atom of the metamodel is the Version, a globally unique identifier, representing a immutable version of some object depicted by the small circles in the bottom layer of Figure 2, in the form of DAG.
Model Graphs: Application Context
- Ground uses a graph model for flexibility: graphs can represent metadata entities and relationships from semi-structured(e.g., JSON, XML) and structured(e.g., Relational, OO, matrix) data model.
- Graph Element: Node, Edge, Graph, Structure
- External Items and Schrödinger Versioning
Linear Graphs: Behavior
- The goal of the lineage graph layer is to capture usage information composed from the nodes and edges in the model graph.