conference paper
Quantized distributed training of large models with convergence guarantees
PMLR
published
yes
Ilia
Markov
author D0CF4148-C985-11E9-8066-0BDEE5697425
Adrian
Vladu
author
Qi
Guo
author
Dan-Adrian
Alistarh
author 4A899BFC-F248-11E8-B48F-1D18A9856A870000-0003-3650-940X
DaAl
department
ICML: International Conference on Machine Learning
Elastic Coordination for Scalable Machine Learning
project
Communication-reduction techniques are a popular way to improve scalability in data-parallel training of deep neural networks (DNNs). The recent emergence of large language models such as GPT has created the need for new approaches to exploit data-parallelism. Among these, fully-sharded data parallel (FSDP) training is highly popular, yet it still encounters scalability bottlenecks. One reason is that applying compression techniques to FSDP is challenging: as the vast majority of the communication involves the model’s weights, direct compression alters convergence and leads to accuracy loss. We present QSDP, a variant of FSDP which supports both gradient and weight quantization with theoretical guarantees, is simple to implement and has essentially no overheads. To derive QSDP we prove that a natural modification of SGD achieves convergence even when we only maintain quantized weights, and thus the domain over which we train consists of quantized points and is, therefore, highly non-convex. We validate this approach by training GPT-family models with up to 1.3 billion parameters on a multi-node cluster. Experiments show that QSDP preserves model accuracy, while completely removing the communication bottlenecks of FSDP, providing end-to-end speedups of up to 2.2x.
ML Research Press2023Honolulu, Hawaii, HI, United States
eng
Proceedings of the 40th International Conference on Machine Learning
2640-3498
2302.02390
20224020-24044
https://research-explorer.ista.ac.at/record/17490
Markov, I., Vladu, A., Guo, Q., & Alistarh, D.-A. (2023). Quantized distributed training of large models with convergence guarantees. In <i>Proceedings of the 40th International Conference on Machine Learning</i> (Vol. 202, pp. 24020–24044). Honolulu, Hawaii, HI, United States: ML Research Press.
Markov, Ilia, et al. “Quantized Distributed Training of Large Models with Convergence Guarantees.” <i>Proceedings of the 40th International Conference on Machine Learning</i>, vol. 202, ML Research Press, 2023, pp. 24020–44.
Markov I, Vladu A, Guo Q, Alistarh D-A. 2023. Quantized distributed training of large models with convergence guarantees. Proceedings of the 40th International Conference on Machine Learning. ICML: International Conference on Machine Learning, PMLR, vol. 202, 24020–24044.
Markov, Ilia, Adrian Vladu, Qi Guo, and Dan-Adrian Alistarh. “Quantized Distributed Training of Large Models with Convergence Guarantees.” In <i>Proceedings of the 40th International Conference on Machine Learning</i>, 202:24020–44. ML Research Press, 2023.
Markov I, Vladu A, Guo Q, Alistarh D-A. Quantized distributed training of large models with convergence guarantees. In: <i>Proceedings of the 40th International Conference on Machine Learning</i>. Vol 202. ML Research Press; 2023:24020-24044.
I. Markov, A. Vladu, Q. Guo, and D.-A. Alistarh, “Quantized distributed training of large models with convergence guarantees,” in <i>Proceedings of the 40th International Conference on Machine Learning</i>, Honolulu, Hawaii, HI, United States, 2023, vol. 202, pp. 24020–24044.
I. Markov, A. Vladu, Q. Guo, D.-A. Alistarh, in:, Proceedings of the 40th International Conference on Machine Learning, ML Research Press, 2023, pp. 24020–24044.
144612023-10-29T23:01:17Z2024-10-09T21:07:10Z