CGX: Adaptive system support for communication-efficient deep learning

Markov, Ilia; Ramezanikebrya, Hamidreza; Alistarh, Dan-Adrian

CGX: Adaptive system support for communication-efficient deep learning

Markov I, Ramezanikebrya H, Alistarh D-A. 2022. CGX: Adaptive system support for communication-efficient deep learning. Proceedings of the 23rd ACM/IFIP International Middleware Conference. Middleware: International Middleware Conference, 241–254.

Download

2022_ACMMiddleware_Markov.pdf 1.51 MB [Published Version]

DOI

10.1145/3528535.3565248

Conference Paper | Published | English

Scopus indexed

Author

Markov, Ilia^ISTA; Ramezanikebrya, Hamidreza; Alistarh, Dan-Adrian^ISTA

Corresponding author has ISTA affiliation

Department

Alistarh Group

Abstract

The ability to scale out training workloads has been one of the key performance enablers of deep learning. The main scaling approach is data-parallel GPU-based training, which has been boosted by hardware and software support for highly efficient point-to-point communication, and in particular via hardware bandwidth over-provisioning. Overprovisioning comes at a cost: there is an order of magnitude price difference between "cloud-grade" servers with such support, relative to their popular "consumer-grade" counterparts, although single server-grade and consumer-grade GPUs can have similar computational envelopes. In this paper, we show that the costly hardware overprovisioning approach can be supplanted via algorithmic and system design, and propose a framework called CGX, which provides efficient software support for compressed communication in ML applications, for both multi-GPU single-node training, as well as larger-scale multi-node training. CGX is based on two technical advances: At the system level, it relies on a re-developed communication stack for ML frameworks, which provides flexible, highly-efficient support for compressed communication. At the application level, it provides seamless, parameter-free integration with popular frameworks, so that end-users do not have to modify training recipes, nor significant training code. This is complemented by a layer-wise adaptive compression technique which dynamically balances compression gains with accuracy preservation. CGX integrates with popular ML frameworks, providing up to 3X speedups for multi-GPU nodes based on commodity hardware, and order-of-magnitude improvements in the multi-node setting, with negligible impact on accuracy.

Publishing Year

2022

Date Published

2022-11-01

Proceedings Title

Proceedings of the 23rd ACM/IFIP International Middleware Conference

Publisher

Association for Computing Machinery

Acknowledgement

The authors sincerely thank Nikoli Dryden, Tal Ben-Nun, Torsten Hoefler and Bapi Chatterjee for useful discussions throughout the development of this project.

Page

241-254

Conference

Middleware: International Middleware Conference

Conference Location

Quebec, QC, Canada

Conference Date

2022-11-07 – 2022-11-11

ISBN

9781450393409

IST-REx-ID

12780

Cite this

Markov I, Ramezanikebrya H, Alistarh D-A. CGX: Adaptive system support for communication-efficient deep learning. In: Proceedings of the 23rd ACM/IFIP International Middleware Conference. Association for Computing Machinery; 2022:241-254. doi:10.1145/3528535.3565248

Markov, I., Ramezanikebrya, H., & Alistarh, D.-A. (2022). CGX: Adaptive system support for communication-efficient deep learning. In Proceedings of the 23rd ACM/IFIP International Middleware Conference (pp. 241–254). Quebec, QC, Canada: Association for Computing Machinery. https://doi.org/10.1145/3528535.3565248

Markov, Ilia, Hamidreza Ramezanikebrya, and Dan-Adrian Alistarh. “CGX: Adaptive System Support for Communication-Efficient Deep Learning.” In Proceedings of the 23rd ACM/IFIP International Middleware Conference, 241–54. Association for Computing Machinery, 2022. https://doi.org/10.1145/3528535.3565248.

I. Markov, H. Ramezanikebrya, and D.-A. Alistarh, “CGX: Adaptive system support for communication-efficient deep learning,” in Proceedings of the 23rd ACM/IFIP International Middleware Conference, Quebec, QC, Canada, 2022, pp. 241–254.

Markov, Ilia, et al. “CGX: Adaptive System Support for Communication-Efficient Deep Learning.” Proceedings of the 23rd ACM/IFIP International Middleware Conference, Association for Computing Machinery, 2022, pp. 241–54, doi:10.1145/3528535.3565248.

All files available under the following license(s):

Creative Commons Attribution 4.0 International Public License (CC-BY 4.0):

https://creativecommons.org/licenses/by/4.0/
https://creativecommons.org/licenses/by/4.0/legalcode

Main File(s)

File Name

2022_ACMMiddleware_Markov.pdf 1.51 MB

Access Level

Open Access

Date Uploaded

2023-04-03

MD5 Checksum

1a397746235f245da5468819247ff663

Material in ISTA:

Dissertation containing ISTA record

Communication-efficient distributed training of deep neural networks : An algorithms and systems perspective

Export

Marked Publications

Open Data ISTA Research Explorer

CGX: Adaptive system support for communication-efficient deep learning

Cite this

Export

Web of Science

Sources

Search this title in