Weld: A Common Runtime for High Performance Data AnalyticsReference: Palkar, Shoumik, et al. "Weld: A common runtime for high performance data analytics." Conference on Innovative Data Systems Research (CIDR). 2017.
Modern analytics applications combine multiple functions from different libraries and frameworks to build increasingly complex workflows. Even though each function may achieve high performance in isolation, the performance of the combined workflow is often an order of magnitude below hardware limits due to extensive data movement across the functions. To address this problem, we propose Weld, a runtime for data-intensive applications that optimizes across disjoint libraries and functions. Weld uses a common intermediate representation to capture the structure of diverse data-parallel workloads, including SQL, machine learning and graph analytics. It then performs key data movement optimizations and generates efficient parallel code for the whole workflow. Weld can be integrated incrementally into existing frameworks like TensorFlow, Apache Spark, NumPy and Pandas without changing their user-facing APIs. We show that Weld can speed up these frameworks, as well as applications that combine them, by up to 30x.
- To avoid data movement between different libraries, functions, runtimes, we propose Weld.
- Two key ideas: (1) Weld makes libraries express their work using a functional-like intermediate representation (IR) that is highly amenable to cross-library optimizations such as loop fusion, vectorization, data layout changes, and loop tiling; (2) Weld offers a runtime API based on lazy evaluation that lets applications build up a Weld computation by calling different libraries, and then optimizes across them.
- We argue that as applications start to use more and more disjoint libraries, one size will have to fit all to achieve bare-metal performance across them, since data movement is costly and getting worse.
2. Design Philosophy
- It is difficult and inefficient to extend RDBMS with new data types or operators.
- Work with independently written libraries.
- Enable the most impactful cross-library optimizations: data movement and parallelism.
- Integrate incrementally into existing systems.
3. System Overview
- An intermediate representation (IR) that captures the structure of common data-parallel algorithms, and supports rich optimizations across them. The minimal IR is sufficiently general to support functional APIs such as Spark, relational operators, linear algebra and graph algorithms.
- A runtime API that lets libraries expose parts of their computation as Weld IR fragments.
- A compiler backend that maps the final, combined Weld IR program to efficient multithreaded code. We implemented our current backend using LLVM. Because the Weld IR is explicitly parallel, our backend automatically implements multithreading and vectorization using Intel AVX2.
4. WELD IR
Properties of IR
- Library Composition
- Explicit Parallelism
Ability to Express Optimizations
- Relation algebra expressions, contain explicitly parallel operators but do not support complex composition such as nesting.
- Compiler IRs such as LLVM are not explicitly parallel, and would require us to infer parallel structure from low-level operations such as loads and stores.
- Thus, we developed a simple parallel IR similar to monad comprehensions. Our IR is based on parallel loops and a construct for merging results called “builders”.
4.1 Data Model
- Weld’s IR contains basic operators for arithmetic, assigning names to values, sequential looping, and read operations on collections (e.g., lookups into a hash table). It also contains a foreign function interface for calling external C functions.
- Our IR uses Static Single Assignment form, meaning that variables are immutable once defined, simplifying its analysis.
- Two parallel operators: parallel loop and builder. A builder is a declarative data type that computes a result in parallel. Builders are write-only, build-once; expressions such as parallel loops can merge values into a builder, but a final result can only be materialized once, after these merges are done.
- merge(b, v) adds a new value v into the builder b and returns a new builder to represent the result.
- result(builder) destroys the builder and returns its final result.
- for(vector, builders, func) operator applies a function of type (builders, T) => builders to each element of a vector in parallel, updating one more builders for each one, and returns the final set of builders.
4.3 Generality of the IR
- IR could support expressions in MapReduce, Spark, MADlib
- Since IR is fully deterministic, it cannot express asynchronous algorithms where threads race to update a result, such as Hogwild!.
4.4 Why Loops and Builders?
5. Runtime API
To use Weld for this program, methods in the DataFrame class in Pandas must be extended to return a lazily evaluated Weld object. We must also provide a Weld implementation for the > operator on DataFrame columns and for the numpy.sum function.
6. Prototype Evaluation
6.1 Performance vs. State of the Art
- Workloads: a set of TPC-H for SQL, PageRank for graph analytics, and a simple neural network called Word2Vec for machine learning.
- For each, we compare Weld to a hand-optimized C++ implementation and an existing high-performance framework.
6.2 Accelerating Existing Frameworks
- Frameworks: Spark SQL for relation queries, TensforFlow for machine learning, NumPy for linear algebra, and Pandas for data science.
6.3 Cross-Library Optimization
- Workloads: a Spark SQL query that calls a User-Defined Function (UDF) written in Scala, a Python data science workload that combines Pandas and NumPy.
6.4 Integration Effort
- Each integration required about 500 lines of onetime “glue” code per framework, which mainly involves marshalling input data, calling Weld’s API to build expressions, and subsequent un-marshalling of the output produced by Weld.
7. Related Work
- Code generation: HyPer, TupleWare
- Functional or relation operator as a parallel IR: DryadLINQ, Spark
- Low-level interfaces to diverse parallel hardware: OpenCL
8. Future Work
- Data placement
- Data access methods
- Domain-specific extensions
- Multi-query execution and optimization
- Heterogeneous hardware