Reading Group. Cerebro: A Layered Data Platform for Scalable Deep Learning

In the 58th reading group session, we covered “Cerebro: A Layered Data Platform for Scalable Deep Learning.” This was a short meeting, as our original presenter, unfortunately, had some other important commitments. There are actually two Cerebro papers, one appeared at CIDR and another one at VLDB’20. The main premise behind the system is that ML models are rarely trained in isolation. Instead, we often train multiple models concurrently, as we try to find optimal parameters for the best results. This hyper-parameter tuning is costly as it relies on an exhaustive search for the best parameter configuration. Also, as the authors claim, most systems do not take this multi-model/config-same-data parallelism into account.

A trivial way to deal with running multiple jobs on the same data is to simply run multiple models/configurations as if they are totally independent. Of course, this can lead to data amplification, as we may need to copy the same data to different workers or even systems. Or we can run these models one at a time — as one model/parameter configuration is done, the next one can start. Cerebro proposes a different approach that allows models to run concurrently, avoids data amplification, and reduces communication while preserving good model convergence. The approach shards the data to workers and allows different models to start on these different shards at the same time. Once a worker-full of data has been processed, the entire model (all the weights, etc) can migrate to a worker with another shard. The paper claims that this reduces the data-transfer demand in the cluster compared to approaches the must continuously update model weights (i.e. in the parameter servers)

I am not an ML expert, and my improvised presentation was rather bad, so instead, I will link to a YouTube presentation by the authors, which explains the whole thing much better than I can do.

Discussion

1) Still a money problem. Cerebro aims to improve the efficiency of hyper-parameter tuning, but it still seems like a money problem, as we need to throw more resources to exhaustively find the best model. Are there better approaches than the exhaustive search like that? Some heuristics? Or past experience on similar problems/data to take into account?

2) Pruning. The paper(s) seem to train all models/configurations all the way to the end. Again, this is probably wasteful — if we see that some model/configuration does not converge fast enough, it makes sense to kill it early. The system seems to support this and is built around this idea as well, but it is not as explicit in the paper. In Cerebro, scheduling is done on the per-epoch level, so there is a chance to evaluate each model/parameter-config after each training epoch and adjust.