The Case for Interactive Data Exploration Accelerators (IDEAs)Reference: Crotty, Andrew, et al. "The case for interactive data exploration accelerators (IDEAs)." Proceedings of the Workshop on Human-In-the-Loop Data Analytics. ACM, 2016.
Enabling interactive visualization over new datasets at “human speed” is key to democratizing data science and maximizing human productivity. In this work, we first argue why existing analytics infrastructures do not support interactive data exploration and then outline the challenges and opportunities of building a system specifically designed for interactive data exploration. Finally, we present an Interactive Data Exploration Accelerator (IDEA), a new type of system for interactive data exploration that is specifically designed to integrate with existing data management landscapes and allow users to explore their data instantly without expensive data preparation costs.
- Interactive Data Exploration: which connects to existing data management infrastructures in order to speed up query processing for visual data exploration tools like Vizdom.
- We propose a new breed of systems that support: (1) immediate exploration of new datasets without the need for expensive data preparation; (2) visual input and output; (3) complex analytics tasks like machine learning (ML); (4) “conversational” user interactions with early results that progressively refine over time.
2. Challenges and Opportunities
- Interactive Latencies: even small delays of more than 500ms can significantly impact the data exploration process and the number of insights a user makes.
- Rare Data Items
- Connect and Explore
- Interactive ML
- Quantifying Risk
- Human Perception
- Think Time
- Interaction Times
- Query Sessions and Reuse
- Data Source Capabilities
- Modern Hardware
3. The A-ware System
- Vizdom connects to A-WARE using a standard REST interface, which in turn connects to the data sources using the appropriate protocols (e.g., ODBC).
- Vizdom connects to A-WARE, which acts as an intelligent cache and streaming engine that uses Tupleware as a runtime for more complex analytics tasks.
Tupleware is a general purpose distributed analytics framework specifically designed for small high-performance clusters, thereby allowing A-WARE to take full advantage of modern hardware.
- A-WARE roughly divides the memory into three parts: the Result Cache, the Sample Store, and space for Indexes.
- When triggered by an initial user interaction, A-WARE begins ingesting data from the various data sources, speculatively performing operations and caching the results in the Result Cache to support possible future interactions.
- At the same time, A-WARE also caches all incoming data in the Sample Store using a compressed row format.
3.2 Research Findings and Contributions
- Neither a DBMS nor a Streaming Engine
- Visual Indexes
- Sample Management
- Probability Formulation