--- res: bibo_abstract: - Deep learning at scale is dominated by communication time. Distributing samples across nodes usually yields the best performance, but poses scaling challenges due to global information dissemination and load imbalance across uneven sample lengths. State-of-the-art decentralized optimizers mitigate the problem, but require more iterations to achieve the same accuracy as their globally-communicating counterparts. We present Wait-Avoiding Group Model Averaging (WAGMA) SGD, a wait-avoiding stochastic optimizer that reduces global communication via subgroup weight exchange. The key insight is a combination of algorithmic changes to the averaging scheme and the use of a group allreduce operation. We prove the convergence of WAGMA-SGD, and empirically show that it retains convergence rates similar to Allreduce-SGD. For evaluation, we train ResNet-50 on ImageNet; Transformer for machine translation; and deep reinforcement learning for navigation at scale. Compared with state-of-the-art decentralized SGD variants, WAGMA-SGD significantly improves training throughput (e.g., 2.1× on 1,024 GPUs for reinforcement learning), and achieves the fastest time-to-solution (e.g., the highest score using the shortest training time for Transformer).@eng bibo_authorlist: - foaf_Person: foaf_givenName: Shigang foaf_name: Li, Shigang foaf_surname: Li - foaf_Person: foaf_givenName: Tal Ben-Nun foaf_name: Tal Ben-Nun, Tal Ben-Nun foaf_surname: Tal Ben-Nun - foaf_Person: foaf_givenName: Giorgi foaf_name: Nadiradze, Giorgi foaf_surname: Nadiradze foaf_workInfoHomepage: http://www.librecat.org/personId=3279A00C-F248-11E8-B48F-1D18A9856A87 - foaf_Person: foaf_givenName: Salvatore Di foaf_name: Girolamo, Salvatore Di foaf_surname: Girolamo - foaf_Person: foaf_givenName: Nikoli foaf_name: Dryden, Nikoli foaf_surname: Dryden - foaf_Person: foaf_givenName: Dan-Adrian foaf_name: Alistarh, Dan-Adrian foaf_surname: Alistarh foaf_workInfoHomepage: http://www.librecat.org/personId=4A899BFC-F248-11E8-B48F-1D18A9856A87 orcid: 0000-0003-3650-940X - foaf_Person: foaf_givenName: Torsten foaf_name: Hoefler, Torsten foaf_surname: Hoefler bibo_doi: 10.1109/TPDS.2020.3040606 bibo_issue: '7' bibo_volume: 32 dct_date: 2021^xs_gYear dct_identifier: - UT:000621405200019 dct_isPartOf: - http://id.crossref.org/issn/10459219 dct_language: eng dct_publisher: IEEE@ dct_title: Breaking (global) barriers in parallel stochastic optimization with wait-avoiding group averaging@ ...