Predicting a Better Future for Asynchronous Stochastic Gradient Decent with DANA

עידו חכימי, הרצאה סמינריונית לדוקטורט
יום שלישי, 30.10.2018, 14:30
טאוב 601
Prof. Assaf Schuster

Distributed training can significantly reduce the training time of neural networks. Despite its potential, however, distributed training has not been widely adopted due to the difficulty of scaling the training process. Existing methods suffer from slow convergence and low final accuracy when scaling to large clusters, and often require substantial re-tuning of hyper-parameters.

We propose DANA, a novel approach that scales to large clusters while maintaining state-of-the-art accuracy and converge speed without having to re-tune parameters that are optimized for training on a single worker. By adapting Nesterov Accelerated Gradient to a distributed setting, DANA is able to predict the future position of the model's parameters and so mitigate the effect of gradient staleness, one of the main difficulties in asynchronous SGD.

בחזרה לאינדקס האירועים