Alpine Meadow 2.0: New Horizons in AutoML
TL;DR: This bost is about something I really would like to do but unfortunately haven’t got time yet :).
Following up my last post, I would like to talk about my ideas about the future directions in AutoML research, which are mostly based on my personal interests and some “pain points” I encountered with while working on AutoML-related things.
As far as I know, most of the AutoML studies focus on a relative “closed” problem setting, i.e., given the dataset (w/ or w/o train/validation splits), the target column and selection metric (e.g., accuracy), finding a pipeline that is as good as possible (in terms of the given metric).
However, in the real world, the data keeps changing over time, either the volume increases as more and more data comes or the distribution shifts. To this end, an “online” AutoML system should be able to take advantage of this and adapt its decision making over time. No matter the decision making process is a model or some heuristics, they shall learn the trends of incoming data while not forgetting the knowledge learned from previous inputs.
Interpretable and Interactive AutoML
AutoML has always been a black box, especially the decision making process. It usually involves meta-learning, Bayesian Optimization, multi-armed bandit and even deep reinforcement learning. All these techniques are mostly hard to interpret, which poses an huge challenge for understanding the behaviors of the AutoML system.
However, besides the predictive performance of pipeline, the real world users tend to understand how the pipelines have been selected. What kind of primitives or pipelines has the system tried? Based on what observations the system decided to switch to another primitive/pipeline or put into more resources for this primitive/pipeline?
We are currently using a cost model (with a set of heuristics) and it provides good explainability. You can easily know a pipeline is selected because of its relative advantages in performance and speed. We are working on this to make this process more visually interpretable.
AutoML is a process of trials and errors, which usually involves training and evaluating many pipelines. Therefore ensemble learning works seamlessly with AutoML, since these pipelines can be ensembled to make a more powerful pipeline. Previous works like Auto-Sklearn employ ensembling at the end of the search and build a ensemble of pipelines using some greedy methods.
However, simply ensembling all the evaluated pipelines is not efficient since ensembling itself has overheads. Further, ensemble exploits the diversity of models while during the AutoML search we only favor pipelines with better performance. It could be the case that at the end of the search there are lots of good pipelines similar with each other, and then ensemble would not help much. In this sense, to better promote ensembling, we want to consider the possible benefit to ensembling when we select pipelines, that is, if a pipeline might greatly improve the diversity, we should simply try it, even though the pipeline may not have the optimal performance.
In other words, ensembling changes the goal of AutoML from finding the best pipeline to finding the best group of pipelines (to form the best ensemble). We are going to update the cost model in our system to favor the exploration, thus encouraging more diversity in the pipeline traces.
Efficient evaluation of pipelines have been neglected by many previous works. Most systems simply do cross-validation at this step, and this could be impractical or sub-optimal for big datasets. At the same time, for classical learning methods (e.g., SVM), their capacities are relatively small and they don’t require many data points. This implies that we are able to predict the predictive performance of a model while only training it on a small subset.
In our SIGMOD 2019 paper, we proposed the Adaptive Pipeline Selection algorithm, which trains models over increasingly-larger samples and evaluate them on the samples (on which they are trained) and the validation dataset. We further used the training error as a lower bound for the validation error. Therefore if the training error is beyond the current best validation error, we think it is very unlikely for this model to have a better performance than the current best model, and then we can simply prune this model to save more computational resources. Another advantage brought by this algorithm is the training efficiency. Since we train over larger samples, for learning algorithms which support incremental training, we only need to train over the difference of samples.
There are a bunch of potential improvements here. Instead of using the training error to estimate the validation error, we can predict the learning curve to estimate the validation error. We can also adopt a “soft” pruning method here. Instead of kill the pipeline immediately, we probably can just starve it by allocating less resources on it, to avoid killing a good pipeline in the early stage.
Execution is another important topic which have been rarely discussed by previous works. At the end of day, if we are able to execute pipelines 1,000 times faster than other systems, we probably don’t need a fancy pipeline selection algorithm there, we can just do grid search and it is good. Therefore we somehow want to build a specialized system for executing the workloads for AutoML.
One thing we notice which is probably specific to AutoML is that lots of pipelines share similar structures (e.g., the same primitives), and this means caching the intermediate outputs of primitives is probably a good idea. Assume that you are working on an image classification problem and you want to use a pre-trained neural network (maybe on ImageNet) to extract high-level features and then select a simple predictive model (e.g., SVM, logistic regression). Then if you cache the outputs of the neural networks, you can save lots of time since the inference of neural networks takes a long time. In other words, you can try many more different models with different hyper-parameters than others if you adopt caching on this problem.
You can even employ a more fine-grained caching, e.g., if you run the pre-trained neural network on the first 100 data points, and if you want to get the high-level features for the first 200 data points, you can just execute the neural network over the difference and use the cached output for the overlap.
We are currently building an awesome system designed for the general data analytical workloads (including AutoML, OLAP and many others), and I probably will write a post about it later.
To sum up, there are new needs for AutoML and we shall do an end-to-end re-design of the AutoML system, from the interface, through the decision process, down to the execution. There are millions of opportunities in this emerging area and I wish there could be more publications on these interesting topics.