Communication-efficient distributed training of deep neural networks : An algorithms and systems perspective

Markov, Ilia

Communication-efficient distributed training of deep neural networks : An algorithms and systems perspective

Markov I. 2024. Communication-efficient distributed training of deep neural networks : An algorithms and systems perspective. Institute of Science and Technology Austria.

Download

Thesis_final_version_pdfa2.pdf 2.76 MB [Published Version]

DOI

10.15479/at:ista:17490

Thesis | PhD | Published | English

Author

Markov, Ilia^ISTA

Supervisor

Alistarh, Dan-Adrian^ISTA

Corresponding author has ISTA affiliation

Department

Graduate School
Alistarh Group

Grant

Elastic Coordination for Scalable Machine Learning

Series Title

ISTA Thesis

Abstract

Deep learning is essential in numerous applications nowadays, with many recent advancements made possible by training very large models. Despite their broad applicability, training neural networks is often time-intensive, and it is usually impractical to manage large models and datasets on a single machine. To address these issues, distributed deep learning training has become increasingly important. However, distributed training requires synchronization among nodes, and the mini-batch stochastic gradient descent algorithm places a significant load on network connections. A possible solution to tackle the synchronization bottleneck is to reduce a message size by lossy compression. In this thesis, we investigate systems and algorithmic approaches to communication compression during training. From the systems perspective, we demonstrate that a common approach of expensive hardware overprovisioning can be replaced through a thorough system design. We introduce a framework that introduces efficient software support for compressed communication in machine learning applications, applicable to both multi-GPU single-node training and larger-scale multi-node training. Our framework integrates with popular ML frameworks, providing up to 3x speedups for multi-GPU nodes based on commodity hardware and order-of-magnitude improvements in the multi-node setting, with negligible impact on accuracy. Also, we consider an application of our framework to different communication schemes, such as Fully Sharded Data Parallel. We provide strong convergence guarantees for the compression in such a setup. Empirical validation shows that our method preserves model accuracy for GPT-family models with up to 1.3 billion parameters, while completely removing the communication bottlenecks of non-compressed alternatives, providing up to 2.2x speedups end-to-end. From the algorithmic side, we propose a general framework that dynamically adjusts the degree of compression across a model's layers during training. This approach enhances overall compression and results in significant speedups without compromising accuracy. Our algorithm utilizes an adaptive algorithm that automatically selects the optimal compression parameters for model layers, ensuring the best compression ratio while adhering to an error constraint. Our method is effective across all existing families of compression methods. It achieves up to 2.5x faster training and up to a 5x improvement in compression compared to efficient implementations of current approaches. Additionally, LGreCo can complement existing adaptive algorithms.

Publishing Year

2024

Date Published

2024-09-04

Publisher

Institute of Science and Technology Austria

Acknowledged SSUs

Scientific Computing

Page

102

ISSN

2663-337X

IST-REx-ID

17490

Cite this

Markov I. Communication-efficient distributed training of deep neural networks : An algorithms and systems perspective. 2024. doi:10.15479/at:ista:17490

Markov, I. (2024). Communication-efficient distributed training of deep neural networks : An algorithms and systems perspective. Institute of Science and Technology Austria. https://doi.org/10.15479/at:ista:17490

Markov, Ilia. “Communication-Efficient Distributed Training of Deep Neural Networks : An Algorithms and Systems Perspective.” Institute of Science and Technology Austria, 2024. https://doi.org/10.15479/at:ista:17490.

I. Markov, “Communication-efficient distributed training of deep neural networks : An algorithms and systems perspective,” Institute of Science and Technology Austria, 2024.

Markov I. 2024. Communication-efficient distributed training of deep neural networks : An algorithms and systems perspective. Institute of Science and Technology Austria.

Markov, Ilia. Communication-Efficient Distributed Training of Deep Neural Networks : An Algorithms and Systems Perspective. Institute of Science and Technology Austria, 2024, doi:10.15479/at:ista:17490.

All files available under the following license(s):

Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0):