Synchronous multi-GPU training for deep learning with low-precision communications: An empirical study

Grubic, Demjan; Tam, Leo; Alistarh, Dan-Adrian; Zhang, Ce

Synchronous multi-GPU training for deep learning with low-precision communications: An empirical study

Grubic D, Tam L, Alistarh D-A, Zhang C. 2018. Synchronous multi-GPU training for deep learning with low-precision communications: An empirical study. Proceedings of the 21st International Conference on Extending Database Technology. EDBT: Conference on Extending Database Technology, 145–156.

Download

2018_OpenProceedings_Grubic.pdf 1.60 MB [Published Version]

DOI

10.5441/002/EDBT.2018.14

Conference Paper | Published | English

Scopus indexed

Author

Grubic, Demjan; Tam, Leo; Alistarh, Dan-Adrian^ISTA ; Zhang, Ce

Corresponding author has ISTA affiliation

Department

Alistarh Group

Abstract

Training deep learning models has received tremendous research interest recently. In particular, there has been intensive research on reducing the communication cost of training when using multiple computational devices, through reducing the precision of the underlying data representation. Naturally, such methods induce system trade-offs—lowering communication precision could de-crease communication overheads and improve scalability; but, on the other hand, it can also reduce the accuracy of training. In this paper, we study this trade-off space, and ask:Can low-precision communication consistently improve the end-to-end performance of training modern neural networks, with no accuracy loss?From the performance point of view, the answer to this question may appear deceptively easy: compressing communication through low precision should help when the ratio between communication and computation is high. However, this answer is less straightforward when we try to generalize this principle across various neural network architectures (e.g., AlexNet vs. ResNet),number of GPUs (e.g., 2 vs. 8 GPUs), machine configurations(e.g., EC2 instances vs. NVIDIA DGX-1), communication primitives (e.g., MPI vs. NCCL), and even different GPU architectures(e.g., Kepler vs. Pascal). Currently, it is not clear how a realistic realization of all these factors maps to the speed up provided by low-precision communication. In this paper, we conduct an empirical study to answer this question and report the insights.

Publishing Year

2018

Date Published

2018-03-26

Proceedings Title

Proceedings of the 21st International Conference on Extending Database Technology

Publisher

OpenProceedings

Page

145-156

Conference

EDBT: Conference on Extending Database Technology

Conference Location

Vienna, Austria

Conference Date

2018-03-26 – 2018-03-29

ISBN

9783893180783

ISSN

2367-2005

IST-REx-ID

7116

Cite this

Grubic D, Tam L, Alistarh D-A, Zhang C. Synchronous multi-GPU training for deep learning with low-precision communications: An empirical study. In: Proceedings of the 21st International Conference on Extending Database Technology. OpenProceedings; 2018:145-156. doi:10.5441/002/EDBT.2018.14

Grubic, D., Tam, L., Alistarh, D.-A., & Zhang, C. (2018). Synchronous multi-GPU training for deep learning with low-precision communications: An empirical study. In Proceedings of the 21st International Conference on Extending Database Technology (pp. 145–156). Vienna, Austria: OpenProceedings. https://doi.org/10.5441/002/EDBT.2018.14

Grubic, Demjan, Leo Tam, Dan-Adrian Alistarh, and Ce Zhang. “Synchronous Multi-GPU Training for Deep Learning with Low-Precision Communications: An Empirical Study.” In Proceedings of the 21st International Conference on Extending Database Technology, 145–56. OpenProceedings, 2018. https://doi.org/10.5441/002/EDBT.2018.14.

D. Grubic, L. Tam, D.-A. Alistarh, and C. Zhang, “Synchronous multi-GPU training for deep learning with low-precision communications: An empirical study,” in Proceedings of the 21st International Conference on Extending Database Technology, Vienna, Austria, 2018, pp. 145–156.

Grubic, Demjan, et al. “Synchronous Multi-GPU Training for Deep Learning with Low-Precision Communications: An Empirical Study.” Proceedings of the 21st International Conference on Extending Database Technology, OpenProceedings, 2018, pp. 145–56, doi:10.5441/002/EDBT.2018.14.

All files available under the following license(s):

Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0):