# MLbase: A Distributed Machine-learning System

Reference: Kraska, Tim, et al. "MLbase: A Distributed Machine-learning System." CIDR. Vol. 1. 2013.

## 0. Abstract

Machine learning (ML) and statistical techniques are key to transforming big data into actionable knowledge. In spite of the modern primacy of data, the complexity of existing ML algorithms is often overwhelming —– many users do not understand the trade-offs and challenges of parameterizing and choosing between different learning techniques. Furthermore, existing scalable systems that support machine learning are typically not accessible to ML researchers with- out a strong background in distributed systems and low-level primitives. In this work, we present our vision for MLbase, a novel system harnessing the power of machine learning for both end-users and ML researchers. MLbase provides (1) a simple declarative way to specify ML tasks, (2) a novel optimizer to select and dynamically adapt the choice of learning algorithm, (3) a set of high-level operators to enable ML re- searchers to scalably implement a wide range of ML methods without deep systems knowledge, and (4) a new run-time optimized for the data-access patterns of these high-level operators.

## 1. Introduction

Existing systems provide little or no help for applying machine learning on Big Data.

Four Major Foci of MLbase’s Design

• MLbase encompasses a new Pig Latin-like declarative language to specify machine learning tasks.
• MLbase uses a novel optimizer to select machine learning algorithms, where we leverage best practices in ML and build a sophisticated cost-based model.
• We aim to provide answers early and improve them in the background, continuously re- fining the model and re-optimizing the plan.
• We design a distributed run-time optimized for the data-access patterns of machine learning.

## 2. Use Cases

### 2.1 ALS Prediction

• The ALS Prediction Prize challenges participants to develop a binary classifier to predict whether an ALS patient will display delayed disease progression.
• The language hides two key issues from the user: (i) which algorithms and parameters the system should use and (ii) how the system should test the model or distribute computation across machines.
• It is the responsibility of MLbase to find, train and test the model, returning a trained classifier function as well as a summary about its performance to the user.

### 2.2 Music Recommendation

• This is a collaborative filtering problem: Specifically, we receive an incomplete observation of a ratings matrix, with columns corresponding to users and rows corresponding to songs, and we aim to infer the unobserved entries of this ratings matrix.

• MLbase provides facilities for graph-structured data and combine feature extraction with database join.

### 2.4 ML Research Platform

• We envision ML experts using MLbase as a platform to experiment with new ML algorithms.MLbase has the advantage that it offers a set of high-level primitives to simplify building distributed machine learning algorithms without knowing the details about data partitioning, mes- sage passing, or load balancing.
• We allow the ML expert to inspect the execution plan using a database-like explain function and steer the optimizer using hints, making it an ideal platform to easily setup experiments.

## 3. Architecture

Figure 1 shows the general architecture of MLbase, which consists of a master and several worker nodes.

• The system parses the request into a logical learning plan (LLP), which describes the most general workflow to perform the ML task.
• The search-space(e.g., ML algorithms, featurization techniques, algorithm parameters, and data sub-sampling strategies) of the LLP is too huge to be explored entirely, therefore, an optimizer tries to prune the search-space of the LLP to find a strategy that is testable in a reasonable time-frame.
• After constructing the optimized logical plan, MLbase transforms it into a physical learning plan (PLP) to be executed.
• In contrast to an LLP, a PLP specifies exactly the parameters to be tested as well as the data (sub)sets to be used. The MLbase master distributes these operations onto the worker nodes, which execute them through the MLbase runtime.
• MLbase will further improve the model in the background via additional exploration.
• MLbase will extensible for adding novel ML algorithms.

## 4. Query Optimization

### 4.1 Logical Learning Plan

• The LLP specifies the combinations of parameters, algorithms, and data sub-sampling the system must evaluate and cross-validate to test quality.

### 4.2 Optimization

• The optimizer actually transforms the LLP into an optimized plan –— with concrete parameters and data sub-sampling strategies –— that can be executed on our run-time.
• To meet time constraints, the optimizer estimates execution time and algorithm performance (i.e., quality) based on statistical models, also taking advantage of pruning heuristics and newly developed on-line model selection tools.
• MLbase allows user-specified hints that can influence the optimizer, which is similar to user influence in database systems

## 5. Runtime

• MLbase’s run-time supports a simple set of data-centric primitives for machine learning tasks. The physical learning plan (PLP) composes these primitives together to build potentially complex workflows.
• The master’s responsibility is to distribute these primitives to the workers for their execution, to monitor progress, and take appropriate actions in the case of a node failure
• The run-time supports the main relational operators, predicate filters, projects, joins and simple trans- formations by applying a higher-order function (similar to a map in the map-reduce paradigm).
• Relaxing consistency can in some cases improve the convergence rate and result in significantly fewer iterations.
Written on May 18, 2017