## AutoML: Methods, Challenges and Opportunities

Recently, we got an paper about interactive AutoML accepted by SIGMOD 2019 (Shang et al., 2019). In that paper, we discussed how we tackled the problem of automated machine learning while providing interactive responses. We architected our system into several components and discussed implementations and techniques for each of them. Although looking back, the overall design seems pretty straightforward, it takes us almost one year and a half to get into the kingdom of AutoML and figure out a reasonable architecture for a practical system. But I are really glad that eventually we figured this out and have this paper accepted to SIGMOD.

Also, there is an awesome book about AutoML recently released, which is written by several well-known researchers from the AutoML community (and their papers are really helpful for us!). Inspired by both of the paper acceptance and the new book, I am going to write a blog post summarizing the methods we used or invented, challenges we faced and opportunities for future research. However, considering this is just a blog post, definitely some details of our methods will not be covered, so if you would like to know anything more about our system, please refer to our paper at SIGMOD 2019.

## 1. Introduction

First, what is AutoML? Based on my understanding, it removes humans from the loop of tedious process of cleaning data, preprocessing features, searching machine learning model and tuning its hyper-parameters. One example is that a doctor has a large volume of data (e.g., age, height, blood test results) collected from his patients and he would like to know if it is possible to build a machine learning pipeline predicting whether a new patient has some disease or not. Then AutoML is a perfect solution here as it doesn’t require the user to have machine learning or computer science knowledge. In other words, it democratizes the machine learning to the general public and make machine learning accessible. This probably explains why it is getting more and more popular.

Nevertheless, AutoML is not supposed to totally bypass humans from the building of machine learning pipeline, otherwise even a perfect model can be found by AutoML, everything is still a black box for the user and it is difficult to have useful findings. This requires AutoML to leave some opportunities to have human efforts in the decision process, especially domain knowledge, e.g., a doctor can give the system some hints on how to clean or process the data, and if the machine learning pipeline has good interpretability, that is even more helpful since the doctor probably has chance to have some interesting discoveries.

Further, it doesn’t make senses for AutoML to be a standalone component. It shall reside on a data exploration platform to better combine traditional data analytics and machine learning. We cannot expect our user to write code to run our AutoML system, we shall have a interactive easy-to-use GUI where user can trigger the AutoML operation. Therefore, it requires a general-purpose data platform to make AutoML actually accessible. That’s why we have integrated our AutoML system Alpine Meadow with Northstar.

Last but not least, AutoML can help the users to find a reasonable ML pipeline, but there is still so-called “the last mile” to go through, e.g., to achieve better performance, the users should be able to hand over this pipeline to some engineers or real data scientists to further improve it. For example, the pipeline found by AutoML can probably be exported as a Python script, and the engineering team can convert this script to a Apache Spark job to scale it on larger volumes of data.

To summarize, in my opinion, a good AutoML system shall be Automated, Interpretable, Interactive, Exportable.

## 2. Methods

For an end-to-end AutoML system, the input is essentially the task description (i.e., dataset and problem, e.g., predict the digits for MNIST dataset), and the output is the so-called “best” pipeline. The metrics to evaluate a pipeline can be just calculated based on the predictions of a pipeline (e.g., accuracy or MST), or the speed of a pipeline (i.e., how long does it take to train or test this pipeline), even combined with each other.

Since now we know what the input and output is, then it comes to the problem of designing all steps in the AutoML system. An intuitive design is that we build the search space (i.e., the space of applicable pipelines), select promising pipelines from the space, evaluate them in some way and return the best pipeline.

### 2.1 Building Search Space

As far as I know, currently all AutoML systems use some template-like methods more or less. Basically we pre-define the search space for each problem type (e.g., classification or regression) and column type (e.g., different scaling methods for numerical features or different encoding methods for categorical features) in some templates. At runtime, the system just reads out the search space based on the input task.

Our system Alpine Meadow improves the template-based method by abstracting the construction of the space into the execution of rules (we adopt this idea from database systems). This enables more flexibility than template-based method as rules can be programmed and added in an easy way, therefore we are able to support multiple dataset/problem types (e.g., image classification, collaborative filtering). And we further define the search space as a space of logical pipeline plans, where each logical pipeline plan is a pipeline DAG with domains of hyper-parameters (i.e., their hyper-parameters are ranges not exact values).

### 2.2 Selecting Promising Pipelines

The search space is usually huge and heterogeneous, therefore sometimes simply taking a pipeline out of the search space by random is not efficient enough (but some randomness is necessary to avoid be trapped in sub-optimal regions). Some systems (e.g., Auto-sklearn (Feurer et al., 2015)) models the selection of pipeline as a hyper-parameter tuning problem, in other words, they convert the primitive structure (i.e., DAG) of a pipeline to some hyper-parameters. Then they are able to employ some hyper-parameter tuning techniques for finding promsing pipelines. For example, Bayesian Optimization has been proved successful for tackling with hyper-parameter tuning and is also now widely-used. I am not going to talk about Bayesian Optimization here as there is already one very good review (Shahriari et al., 2016).

In our system, the selection of pipeline consist of two parts, selecting the primitive structure (i.e., DAG) of a pipeline (which is defined as a logical pipeline plan), and fine tuning the hyper-parameters. We model the selection of logical pipeline plan as a Multi-Armed Bandit problem and we adopt the idea of cost model from the DB world to estimate a score for each logical pipeline, while the cost model considers the performance (e.g., accuracy) and speed (e.g., time to train or test) at the same time to trade-off between performance and interactiveness. The cost model also employs some meta-learning techniques to improve the estimation by using history data from some similar datasets (e.g., the accuracy and execution time of a pipeline on a similar dataset). For the multi-armed problem, we employ a combination of $$\epsilon$$-greedy and upper confidence bound to select promising logical pipeline plans. After selecting some promising logical pipelines, we use Bayesian Optimization to fine-tune their hyper-parameters.

### 2.3 Evaluation of Pipelines

When we evaluate a candidate pipeline, most of time we would like to find a pipeline with good predicting power, e.g., with high accuracy on test dataset. However, since we don’t have access to the test dataset at runtime, to get a sense of how the pipeline will perform on the test dataset (i.e., the generalization error), we can test it on some validation dataset. There are usually two ways: one is that we have a holdout validation dataset, usually we do 80%-20% (in other words, we split 80% of the input dataset as the train dataset and the rest 20% dataset as the validation dataset). The other way is cross-validation, usually we do k-fold cross-validation where k is set as 3 or 5.

However, they all have some disadvantages. The holdout way is fast but the estimation of generalization power is not accurate, while cross-validation is more accurate but tends to be slow. Based on this observation and also inspired by HyperBand (Li et al., 2016), we devise an Adaptive Pipeline Selection (APS) method which trade-off between speed and accuracy of evaluation. Essentially, the adaptive pipeline selection is a resource-efficient way to evaluate pipeline. The basic idea is that we train and test the pipeline on a small sampled dataset, prune those pipelines performing bad on this dataset, and we increase the size of the dataset and continue this process. One of the major difference between APS and HyperBand is our pruning condition, that is, we use the train error as the upper bound of the validation error, and we compare it with the best-so-far validation error, if the train error is bigger, which means that this pipeline is not likely to be better than the current best, then we can safely prune it.

Furthermore, we also investigate a little bit on optimizing the evaluation from the system perspective. One observation is that, any AutoML systems will try out lots of pipelines at the same time, so it is likely that some pipelines will share some common primitives, which means, some computations can be reused. This is the so-called inter-pipeline caching in our system. The other angle is that by using APS, for a pipeline, since we train and test it on increasingly larger datasets with overlaps, there leaves out some opportunities to reuse the computation from the last iteration, which is the so-called intra-pipeline caching in Alpine Meadow.

## 3. Challenges and Opportunities

Learning of Rules. For now, the rules or heuristics in most AutoML systems are mostly hand-crafted, therefore it prevents from the scaling and coverage of rules. If we manage to make these rules learned either over time or over external history (e.g., learning from models on Kaggle or OpenML), the performance is expected to be greatly improved.

Learned Cost Model. Similar with the learning of rules, cost models can be fine tuned by machine learning. First, we can have better estimation of execution time by learning-based methods, second, the selection or ranking of pipelines can be optimized by learning as well.

Large-Scale Meta-Learning. Meta-learning is getting more and more popular as it aims to find the common “knowledge” underneath different machine learning tasks. Given a new task, if we are able to reuse some prior knowledge from similar tasks we have seen before, the whole search of optimal pipeline can be warm-started. Therefore, if we can figure out a common language for meta-learning (e.g., common description of pipeline and run, including execution time, predicting performance), and scale up the meta-learning data, we have good chance to find a good pipeline for a new task with little time.

Better Ensemble. AutoML systems will evaluate lots of pipelines by design, therefore, if we can ensemble them together into a model, the expected performance (e.g., generalization error) can be further improved. Better ensemble requires us to have it in mind in every aspect of our system, for example, we should encourage diversity when we select pipelines.

Efficient Execution. If our AutoML system is able to evaluate one or two orders of magnitude more pipelines than other systems, then we are very likely to win the game. Caching is one important technique as we have mentioned above, and other opportunities may be using GPU, more fine-grained pruning strategy in APS.

Interpretability. Considering AutoML is an end-to-end process, if would be awesome if we are able to show our users how our system makes the decision of select this pipeline. Rule-based methods are sweet pots here as they are explainable by nature. Also, the cost model provides some intuition as well.

## 4. References

1. Feurer, M., Klein, A., Eggensperger, K., Springenberg, J., Blum, M., & Hutter, F. (2015). Efficient and robust automated machine learning. Advances in Neural Information Processing Systems, 2962–2970.
2. Shahriari, B., Swersky, K., Wang, Z., Adams, R. P., & De Freitas, N. (2016). Taking the human out of the loop: A review of bayesian optimization. Proceedings of the IEEE, 104(1), 148–175.
3. Li, L., Jamieson, K., DeSalvo, G., Rostamizadeh, A., & Talwalkar, A. (2016). Hyperband: A novel bandit-based approach to hyperparameter optimization. ArXiv Preprint ArXiv:1603.06560.
4. Shang, Z., Zgraggen, E., Buratti, B., Kossmann, F., Eichmann, P., Chung, Y., Binnig, C., Upfal, E., & Kraska, T. (2019). Democratizing Data Science through Interactive Curation of ML Pipelines . Proceedings of the 2019 International Conference on Management of Data.
Written on April 5, 2019