{"author":[{"full_name":"Markov, Ilia","last_name":"Markov","first_name":"Ilia","id":"D0CF4148-C985-11E9-8066-0BDEE5697425"}],"file_date_updated":"2024-09-04T08:36:06Z","publication_identifier":{"issn":["2663-337X"]},"_id":"17490","date_created":"2024-09-04T08:51:11Z","oa":1,"supervisor":[{"first_name":"Dan-Adrian","full_name":"Alistarh, Dan-Adrian","last_name":"Alistarh","orcid":"0000-0003-3650-940X","id":"4A899BFC-F248-11E8-B48F-1D18A9856A87"}],"title":"Communication-efficient distributed training of deep neural networks: An algorithms and systems perspective","month":"09","user_id":"8b945eb4-e2f2-11eb-945a-df72226e66a9","publication_status":"published","alternative_title":["ISTA Thesis"],"department":[{"_id":"GradSch"},{"_id":"DaAl"}],"doi":"10.15479/at:ista:17490","page":"102","article_processing_charge":"No","ddc":["000"],"degree_awarded":"PhD","year":"2024","project":[{"call_identifier":"H2020","_id":"268A44D6-B435-11E9-9278-68D0E5697425","grant_number":"805223","name":"Elastic Coordination for Scalable Machine Learning"}],"oa_version":"Published Version","date_published":"2024-09-04T00:00:00Z","tmp":{"name":"Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)","legal_code_url":"https://creativecommons.org/licenses/by-nc-sa/4.0/legalcode","short":"CC BY-NC-SA (4.0)","image":"/images/cc_by_nc_sa.png"},"file":[{"creator":"imarkov","access_level":"closed","date_updated":"2024-09-04T08:35:35Z","relation":"source_file","file_size":43327753,"date_created":"2024-09-04T08:35:35Z","file_id":"17491","checksum":"77609f4835d2730e46fa0d42d9134ed9","file_name":"Thesis.zip","content_type":"application/x-zip-compressed"},{"date_updated":"2024-09-04T08:36:06Z","creator":"imarkov","success":1,"access_level":"open_access","checksum":"9e68f7217570f756ceb8f70b980938cd","file_name":"Thesis_final_version_pdfa2.pdf","content_type":"application/pdf","relation":"main_file","file_size":2756082,"date_created":"2024-09-04T08:36:06Z","file_id":"17492"}],"abstract":[{"lang":"eng","text":"Deep learning is essential in numerous applications nowadays, with many recent advancements made possible by training very large models. Despite their broad applicability, training neural networks is often time-intensive, and it is usually impractical to manage large models and datasets on a single machine. To address these issues, distributed deep learning training has become increasingly important. However, distributed training requires synchronization among nodes, and the mini-batch stochastic gradient descent algorithm places a significant load on network connections. A possible solution to tackle the synchronization bottleneck is to reduce a message size by lossy compression.\r\n\r\nIn this thesis, we investigate systems and algorithmic approaches to communication compression during training. From the systems perspective, we demonstrate that a common approach of expensive hardware overprovisioning can be replaced through a thorough system design. We introduce a framework that introduces efficient software support for compressed communication in machine learning applications, applicable to both multi-GPU single-node training and larger-scale multi-node training. Our framework integrates with popular ML frameworks, providing up to 3x speedups for multi-GPU nodes based on commodity hardware and order-of-magnitude improvements in the multi-node setting, with negligible impact on accuracy.\r\n\r\nAlso, we consider an application of our framework to different communication schemes, such as Fully Sharded Data Parallel. We provide strong convergence guarantees for the compression in such a setup. Empirical validation shows that our method preserves model accuracy for GPT-family models with up to 1.3 billion parameters, while completely removing the communication bottlenecks of non-compressed alternatives, providing up to 2.2x speedups end-to-end.\r\n\r\nFrom the algorithmic side, we propose a general framework that dynamically adjusts the degree of compression across a model's layers during training. This approach enhances overall compression and results in significant speedups without compromising accuracy. Our algorithm utilizes an adaptive algorithm that automatically selects the optimal compression parameters for model layers, ensuring the best compression ratio while adhering to an error constraint. Our method is effective across all existing families of compression methods. It achieves up to 2.5x faster training and up to a 5x improvement in compression compared to efficient implementations of current approaches. Additionally, LGreCo can complement existing adaptive algorithms.\r\n"}],"citation":{"apa":"Markov, I. (2024). Communication-efficient distributed training of deep neural networks: An algorithms and systems perspective. Institute of Science and Technology Austria. https://doi.org/10.15479/at:ista:17490","chicago":"Markov, Ilia. “Communication-Efficient Distributed Training of Deep Neural Networks: An Algorithms and Systems Perspective.” Institute of Science and Technology Austria, 2024. https://doi.org/10.15479/at:ista:17490.","ieee":"I. Markov, “Communication-efficient distributed training of deep neural networks: An algorithms and systems perspective,” Institute of Science and Technology Austria, 2024.","ista":"Markov I. 2024. Communication-efficient distributed training of deep neural networks: An algorithms and systems perspective. Institute of Science and Technology Austria.","ama":"Markov I. Communication-efficient distributed training of deep neural networks: An algorithms and systems perspective. 2024. doi:10.15479/at:ista:17490","mla":"Markov, Ilia. Communication-Efficient Distributed Training of Deep Neural Networks: An Algorithms and Systems Perspective. Institute of Science and Technology Austria, 2024, doi:10.15479/at:ista:17490.","short":"I. Markov, Communication-Efficient Distributed Training of Deep Neural Networks: An Algorithms and Systems Perspective, Institute of Science and Technology Austria, 2024."},"license":"https://creativecommons.org/licenses/by-nc-sa/4.0/","language":[{"iso":"eng"}],"type":"dissertation","acknowledged_ssus":[{"_id":"ScienComp"}],"ec_funded":1,"has_accepted_license":"1","publisher":"Institute of Science and Technology Austria","corr_author":"1","date_updated":"2024-10-21T06:01:53Z","related_material":{"record":[{"relation":"part_of_dissertation","status":"public","id":"17456"},{"status":"public","relation":"part_of_dissertation","id":"14461"},{"relation":"part_of_dissertation","status":"public","id":"12780"}]},"day":"04","status":"public"}