<?xml version="1.0" encoding="UTF-8"?>
<OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd">
<ListRecords>
<oai_dc:dc xmlns="http://www.openarchives.org/OAI/2.0/oai_dc/"
           xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/"
           xmlns:dc="http://purl.org/dc/elements/1.1/"
           xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
           xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
   	<dc:title>Quantized distributed training of large models with convergence guarantees</dc:title>
   	<dc:title>PMLR</dc:title>
   	<dc:creator>Markov, Ilia</dc:creator>
   	<dc:creator>Vladu, Adrian</dc:creator>
   	<dc:creator>Guo, Qi</dc:creator>
   	<dc:creator>Alistarh, Dan-Adrian ; https://orcid.org/0000-0003-3650-940X</dc:creator>
   	<dc:description>Communication-reduction techniques are a popular way to improve scalability in data-parallel training of deep neural networks (DNNs). The recent emergence of large language models such as GPT has created the need for new approaches to exploit data-parallelism. Among these, fully-sharded data parallel (FSDP) training is highly popular, yet it still encounters scalability bottlenecks. One reason is that applying compression techniques to FSDP is challenging: as the vast majority of the communication involves the model’s weights, direct compression alters convergence and leads to accuracy loss. We present QSDP, a variant of FSDP which supports both gradient and weight quantization with theoretical guarantees, is simple to implement and has essentially no overheads. To derive QSDP we prove that a natural modification of SGD achieves convergence even when we only maintain quantized weights, and thus the domain over which we train consists of quantized points and is, therefore, highly non-convex. We validate this approach by training GPT-family models with up to 1.3 billion parameters on a multi-node cluster. Experiments show that QSDP preserves model accuracy, while completely removing the communication bottlenecks of FSDP, providing end-to-end speedups of up to 2.2x.</dc:description>
   	<dc:publisher>ML Research Press</dc:publisher>
   	<dc:date>2023</dc:date>
   	<dc:type>info:eu-repo/semantics/conferenceObject</dc:type>
   	<dc:type>doc-type:conferenceObject</dc:type>
   	<dc:type>text</dc:type>
   	<dc:type>http://purl.org/coar/resource_type/c_5794</dc:type>
   	<dc:identifier>https://research-explorer.ista.ac.at/record/14461</dc:identifier>
   	<dc:source>Markov I, Vladu A, Guo Q, Alistarh D-A. Quantized distributed training of large models with convergence guarantees. In: &lt;i&gt;Proceedings of the 40th International Conference on Machine Learning&lt;/i&gt;. Vol 202. ML Research Press; 2023:24020-24044.</dc:source>
   	<dc:language>eng</dc:language>
   	<dc:relation>info:eu-repo/semantics/altIdentifier/e-issn/2640-3498</dc:relation>
   	<dc:relation>info:eu-repo/semantics/altIdentifier/arxiv/2302.02390</dc:relation>
   	<dc:rights>info:eu-repo/semantics/openAccess</dc:rights>
</oai_dc:dc>
</ListRecords>
</OAI-PMH>
