---
OA_place: publisher
OA_type: diamond
_id: '20037'
abstract:
- lang: eng
  text: 'Disentangling polysemantic neurons is at the core of many current approaches
    to interpretability of large language models. Here we attempt to study how disentanglement
    can be used to understand performance, particularly under weight sparsity, a leading
    post-training optimization technique. We suggest a novel measure for estimating
    neuronal entanglement: the Wasserstein distance of a neuron''s output distribution
    to a Gaussian. Moreover, we show the existence of a small number of highly entangled
    "Wasserstein Neurons" in each linear layer of an LLM, characterized by their highly
    non-Gaussian output distributions, their role in mapping similar inputs to dissimilar
    outputs, and their significant impact on model accuracy. To study these phenomena,
    we propose a new experimental framework for disentangling polysemantic neurons.
    Our framework separates each layer''s inputs to create a mixture of experts where
    each neuron''s output is computed by a mixture of neurons of lower Wasserstein
    distance, each better at maintaining accuracy when sparsified without retraining.
    We provide strong evidence that this is because the mixture of sparse experts
    is effectively disentangling the input-output relationship of individual neurons,
    in particular the difficult Wasserstein neurons.'
acknowledgement: "The authors would like to extend their gratitude to Lori Leu for
  her insightful comments on the\r\napplication of the Wasserstein distance metric.
  We also wish to thank Elias Frantar for his help in\r\nworking with the SparseGPT
  implementation and his advice for the project. Additionally, we would like to thank
  Tony Tong Wang and Thomas Athey for their valuable feedback and constructive discussions.\r\nThis
  work was supported by an NIH Brains CONNECTS U01 grant and AMD’s AI & HPC Fund."
article_processing_charge: No
arxiv: 1
author:
- first_name: Shashata
  full_name: Sawmya, Shashata
  last_name: Sawmya
- first_name: Linghao
  full_name: Kong, Linghao
  last_name: Kong
- first_name: Ilia
  full_name: Markov, Ilia
  id: D0CF4148-C985-11E9-8066-0BDEE5697425
  last_name: Markov
- first_name: Dan-Adrian
  full_name: Alistarh, Dan-Adrian
  id: 4A899BFC-F248-11E8-B48F-1D18A9856A87
  last_name: Alistarh
  orcid: 0000-0003-3650-940X
- first_name: Nir
  full_name: Shavit, Nir
  last_name: Shavit
citation:
  ama: 'Sawmya S, Kong L, Markov I, Alistarh D-A, Shavit N. Wasserstein distances,
    neuronal entanglement, and sparsity. In: <i>13th International Conference on Learning
    Representations</i>. ICLR; 2025:26244-26274.'
  apa: 'Sawmya, S., Kong, L., Markov, I., Alistarh, D.-A., &#38; Shavit, N. (2025).
    Wasserstein distances, neuronal entanglement, and sparsity. In <i>13th International
    Conference on Learning Representations</i> (pp. 26244–26274). Singapore, Singapore:
    ICLR.'
  chicago: Sawmya, Shashata, Linghao Kong, Ilia Markov, Dan-Adrian Alistarh, and Nir
    Shavit. “Wasserstein Distances, Neuronal Entanglement, and Sparsity.” In <i>13th
    International Conference on Learning Representations</i>, 26244–74. ICLR, 2025.
  ieee: S. Sawmya, L. Kong, I. Markov, D.-A. Alistarh, and N. Shavit, “Wasserstein
    distances, neuronal entanglement, and sparsity,” in <i>13th International Conference
    on Learning Representations</i>, Singapore, Singapore, 2025, pp. 26244–26274.
  ista: 'Sawmya S, Kong L, Markov I, Alistarh D-A, Shavit N. 2025. Wasserstein distances,
    neuronal entanglement, and sparsity. 13th International Conference on Learning
    Representations. ICLR: International Conference on Learning Representations, 26244–26274.'
  mla: Sawmya, Shashata, et al. “Wasserstein Distances, Neuronal Entanglement, and
    Sparsity.” <i>13th International Conference on Learning Representations</i>, ICLR,
    2025, pp. 26244–74.
  short: S. Sawmya, L. Kong, I. Markov, D.-A. Alistarh, N. Shavit, in:, 13th International
    Conference on Learning Representations, ICLR, 2025, pp. 26244–26274.
conference:
  end_date: 2025-04-28
  location: Singapore, Singapore
  name: 'ICLR: International Conference on Learning Representations'
  start_date: 2025-04-24
corr_author: '1'
date_created: 2025-07-20T22:02:03Z
date_published: 2025-04-01T00:00:00Z
date_updated: 2025-08-04T08:16:43Z
day: '01'
ddc:
- '000'
department:
- _id: DaAl
external_id:
  arxiv:
  - '2405.15756'
file:
- access_level: open_access
  checksum: 39a8fa7dbdd7029859e156f53f20f6bc
  content_type: application/pdf
  creator: dernst
  date_created: 2025-08-04T08:14:09Z
  date_updated: 2025-08-04T08:14:09Z
  file_id: '20110'
  file_name: 2025_ICLR_Sawmya.pdf
  file_size: 5447177
  relation: main_file
  success: 1
file_date_updated: 2025-08-04T08:14:09Z
has_accepted_license: '1'
language:
- iso: eng
license: https://creativecommons.org/licenses/by/4.0/
month: '04'
oa: 1
oa_version: Published Version
page: 26244-26274
publication: 13th International Conference on Learning Representations
publication_identifier:
  isbn:
  - '9798331320850'
publication_status: published
publisher: ICLR
quality_controlled: '1'
related_material:
  link:
  - relation: software
    url: https://github.com/Shavit-Lab/Sparse-Expansion
scopus_import: '1'
status: public
title: Wasserstein distances, neuronal entanglement, and sparsity
tmp:
  image: /images/cc_by.png
  legal_code_url: https://creativecommons.org/licenses/by/4.0/legalcode
  name: Creative Commons Attribution 4.0 International Public License (CC-BY 4.0)
  short: CC BY (4.0)
type: conference
user_id: 2DF688A6-F248-11E8-B48F-1D18A9856A87
year: '2025'
...
---
OA_place: publisher
OA_type: gold
_id: '20821'
abstract:
- lang: eng
  text: Modern deep neural networks exhibit heterogeneity across numerous layers of
    various types such as residuals, multi-head attention, etc., due to varying structures
    (dimensions, activation functions, etc.), distinct representation characteristics,
    which impact predictions. We develop a general layer-wise quantization framework
    with tight variance and code-length bounds, adapting to the heterogeneities over
    the course of training. We then apply a new layer-wise quantization technique
    within distributed variational inequalities (VIs), proposing a novel Quantized
    Optimistic Dual Averaging (QODA) algorithm with adaptive learning rates, which
    achieves competitive convergence rates for monotone VIs. We empirically show that
    QODA achieves up to a 150% speedup over the baselines in end-to-end training time
    for training Wasserstein GAN on 12+GPUs.
acknowledgement: "This work was supported by Hasler Foundation Program: Hasler Responsible
  AI (project number 21043). The research was also sponsored by the Army Research
  Office and was accomplished under Grant Number W911NF-24-1-0048. This work was further
  funded by the Swiss National Science Foundation (SNSF) under grant number 200021_205011.
  We also acknowledge project A11 of the Swiss National Supercomputing Centre (CSCS)
  for providing computing resources. Dan Alistarh and Ilia Markov were supported in
  part through the ERC Proofof-Concept grant FastML (Grant Agreement 101158077). Ali
  Ramezani-Kebrya was supported by the Research Council of Norway through FRIPRO Grant
  under project number 356103, its Centres of Excellence scheme, Integreat - Norwegian
  Centre for knowledge-driven machine learning under\r\nproject number 332645 - and
  its Centre for Research-based Innovation funding scheme (Visual Intelligence under
  grant no. 309439)."
alternative_title:
- PMLR
article_processing_charge: No
arxiv: 1
author:
- first_name: Anh Duc
  full_name: Nguyen, Anh Duc
  last_name: Nguyen
- first_name: Ilia
  full_name: Markov, Ilia
  id: D0CF4148-C985-11E9-8066-0BDEE5697425
  last_name: Markov
- first_name: Frank Zhengqing
  full_name: Wu, Frank Zhengqing
  last_name: Wu
- first_name: Ali
  full_name: Ramezani-Kebrya, Ali
  last_name: Ramezani-Kebrya
- first_name: Kimon
  full_name: Antonakopoulos, Kimon
  last_name: Antonakopoulos
- first_name: Dan-Adrian
  full_name: Alistarh, Dan-Adrian
  id: 4A899BFC-F248-11E8-B48F-1D18A9856A87
  last_name: Alistarh
  orcid: 0000-0003-3650-940X
- first_name: Volkan
  full_name: Cevher, Volkan
  last_name: Cevher
citation:
  ama: 'Nguyen AD, Markov I, Wu FZ, et al. Layer-wise quantization for quantized optimistic
    dual averaging. In: <i>42nd International Conference on Machine Learning</i>.
    Vol 267. ML Research Press; 2025:46026-46072.'
  apa: 'Nguyen, A. D., Markov, I., Wu, F. Z., Ramezani-Kebrya, A., Antonakopoulos,
    K., Alistarh, D.-A., &#38; Cevher, V. (2025). Layer-wise quantization for quantized
    optimistic dual averaging. In <i>42nd International Conference on Machine Learning</i>
    (Vol. 267, pp. 46026–46072). Vancouver, Canada: ML Research Press.'
  chicago: Nguyen, Anh Duc, Ilia Markov, Frank Zhengqing Wu, Ali Ramezani-Kebrya,
    Kimon Antonakopoulos, Dan-Adrian Alistarh, and Volkan Cevher. “Layer-Wise Quantization
    for Quantized Optimistic Dual Averaging.” In <i>42nd International Conference
    on Machine Learning</i>, 267:46026–72. ML Research Press, 2025.
  ieee: A. D. Nguyen <i>et al.</i>, “Layer-wise quantization for quantized optimistic
    dual averaging,” in <i>42nd International Conference on Machine Learning</i>,
    Vancouver, Canada, 2025, vol. 267, pp. 46026–46072.
  ista: 'Nguyen AD, Markov I, Wu FZ, Ramezani-Kebrya A, Antonakopoulos K, Alistarh
    D-A, Cevher V. 2025. Layer-wise quantization for quantized optimistic dual averaging.
    42nd International Conference on Machine Learning. ICML: International Conference
    on Machine Learning, PMLR, vol. 267, 46026–46072.'
  mla: Nguyen, Anh Duc, et al. “Layer-Wise Quantization for Quantized Optimistic Dual
    Averaging.” <i>42nd International Conference on Machine Learning</i>, vol. 267,
    ML Research Press, 2025, pp. 46026–72.
  short: A.D. Nguyen, I. Markov, F.Z. Wu, A. Ramezani-Kebrya, K. Antonakopoulos, D.-A.
    Alistarh, V. Cevher, in:, 42nd International Conference on Machine Learning, ML
    Research Press, 2025, pp. 46026–46072.
conference:
  end_date: 2025-07-19
  location: Vancouver, Canada
  name: 'ICML: International Conference on Machine Learning'
  start_date: 2025-07-13
date_created: 2025-12-14T23:02:06Z
date_published: 2025-05-01T00:00:00Z
date_updated: 2025-12-16T12:46:54Z
day: '01'
ddc:
- '000'
department:
- _id: DaAl
external_id:
  arxiv:
  - '2505.14371'
file:
- access_level: open_access
  checksum: a7edf0e4304171a3e035842b3aab1704
  content_type: application/pdf
  creator: dernst
  date_created: 2025-12-16T12:45:41Z
  date_updated: 2025-12-16T12:45:41Z
  file_id: '20830'
  file_name: 2025_ICML_Nguyen.pdf
  file_size: 756213
  relation: main_file
  success: 1
file_date_updated: 2025-12-16T12:45:41Z
has_accepted_license: '1'
intvolume: '       267'
language:
- iso: eng
month: '05'
oa: 1
oa_version: Published Version
page: 46026-46072
project:
- _id: 8e35c14b-16d5-11f0-9cad-a3fc35339161
  grant_number: '101158077'
  name: 'FastML: Efficient and Cost-Effective Distributed Machine Learning'
publication: 42nd International Conference on Machine Learning
publication_identifier:
  eissn:
  - 2640-3498
publication_status: published
publisher: ML Research Press
quality_controlled: '1'
scopus_import: '1'
status: public
title: Layer-wise quantization for quantized optimistic dual averaging
tmp:
  image: /images/cc_by.png
  legal_code_url: https://creativecommons.org/licenses/by/4.0/legalcode
  name: Creative Commons Attribution 4.0 International Public License (CC-BY 4.0)
  short: CC BY (4.0)
type: conference
user_id: 2DF688A6-F248-11E8-B48F-1D18A9856A87
volume: 267
year: '2025'
...
---
OA_place: publisher
_id: '17490'
abstract:
- lang: eng
  text: "Deep learning is essential in numerous applications nowadays, with many recent
    advancements made possible by training very large models. Despite their broad
    applicability, training neural networks is often time-intensive, and it is usually
    impractical to manage large models and datasets on a single machine. To address
    these issues, distributed deep learning training has become increasingly important.
    However, distributed training requires synchronization among nodes, and the mini-batch
    stochastic gradient descent algorithm places a significant load on network connections.
    A possible solution to tackle the synchronization bottleneck is to reduce a message
    size by lossy compression.\r\n\r\nIn this thesis, we investigate systems and algorithmic
    approaches to communication compression during training. From the systems perspective,
    we demonstrate that a common approach of expensive hardware overprovisioning can
    be replaced through a thorough system design. We introduce a framework that introduces
    efficient software support for compressed communication in machine learning applications,
    applicable to both multi-GPU single-node training and larger-scale multi-node
    training. Our framework integrates with popular ML frameworks, providing up to
    3x speedups for multi-GPU nodes based on commodity hardware and order-of-magnitude
    improvements in the multi-node setting, with negligible impact on accuracy.\r\n\r\nAlso,
    we consider an application of our framework to different communication schemes,
    such as Fully Sharded Data Parallel. We provide strong convergence guarantees
    for the compression in such a setup. Empirical validation shows that our method
    preserves model accuracy for GPT-family models with up to 1.3 billion parameters,
    while completely removing the communication bottlenecks of non-compressed alternatives,
    providing up to 2.2x speedups end-to-end.\r\n\r\nFrom the algorithmic side, we
    propose a general framework that dynamically adjusts the degree of compression
    across a model's layers during training. This approach enhances overall compression
    and results in significant speedups without compromising accuracy. Our algorithm
    utilizes an adaptive algorithm that automatically selects the optimal compression
    parameters for model layers, ensuring the best compression ratio while adhering
    to an error constraint. Our method is effective across all existing families of
    compression methods. It achieves up to 2.5x faster training and up to a 5x improvement
    in compression compared to efficient implementations of current approaches. Additionally,
    LGreCo can complement existing adaptive algorithms.\r\n"
acknowledged_ssus:
- _id: ScienComp
alternative_title:
- ISTA Thesis
article_processing_charge: No
author:
- first_name: Ilia
  full_name: Markov, Ilia
  id: D0CF4148-C985-11E9-8066-0BDEE5697425
  last_name: Markov
citation:
  ama: 'Markov I. Communication-efficient distributed training of deep neural networks :
    An algorithms and systems perspective. 2024. doi:<a href="https://doi.org/10.15479/at:ista:17490">10.15479/at:ista:17490</a>'
  apa: 'Markov, I. (2024). <i>Communication-efficient distributed training of deep
    neural networks : An algorithms and systems perspective</i>. Institute of Science
    and Technology Austria. <a href="https://doi.org/10.15479/at:ista:17490">https://doi.org/10.15479/at:ista:17490</a>'
  chicago: 'Markov, Ilia. “Communication-Efficient Distributed Training of Deep Neural
    Networks : An Algorithms and Systems Perspective.” Institute of Science and Technology
    Austria, 2024. <a href="https://doi.org/10.15479/at:ista:17490">https://doi.org/10.15479/at:ista:17490</a>.'
  ieee: 'I. Markov, “Communication-efficient distributed training of deep neural networks :
    An algorithms and systems perspective,” Institute of Science and Technology Austria,
    2024.'
  ista: 'Markov I. 2024. Communication-efficient distributed training of deep neural
    networks : An algorithms and systems perspective. Institute of Science and Technology
    Austria.'
  mla: 'Markov, Ilia. <i>Communication-Efficient Distributed Training of Deep Neural
    Networks : An Algorithms and Systems Perspective</i>. Institute of Science and
    Technology Austria, 2024, doi:<a href="https://doi.org/10.15479/at:ista:17490">10.15479/at:ista:17490</a>.'
  short: 'I. Markov, Communication-Efficient Distributed Training of Deep Neural Networks :
    An Algorithms and Systems Perspective, Institute of Science and Technology Austria,
    2024.'
corr_author: '1'
date_created: 2024-09-04T08:51:11Z
date_published: 2024-09-04T00:00:00Z
date_updated: 2026-04-07T13:00:54Z
day: '04'
ddc:
- '000'
degree_awarded: PhD
department:
- _id: GradSch
- _id: DaAl
doi: 10.15479/at:ista:17490
ec_funded: 1
file:
- access_level: closed
  checksum: 77609f4835d2730e46fa0d42d9134ed9
  content_type: application/x-zip-compressed
  creator: imarkov
  date_created: 2024-09-04T08:35:35Z
  date_updated: 2024-09-04T08:35:35Z
  file_id: '17491'
  file_name: Thesis.zip
  file_size: 43327753
  relation: source_file
- access_level: open_access
  checksum: 9e68f7217570f756ceb8f70b980938cd
  content_type: application/pdf
  creator: imarkov
  date_created: 2024-09-04T08:36:06Z
  date_updated: 2024-09-04T08:36:06Z
  file_id: '17492'
  file_name: Thesis_final_version_pdfa2.pdf
  file_size: 2756082
  relation: main_file
  success: 1
file_date_updated: 2024-09-04T08:36:06Z
has_accepted_license: '1'
language:
- iso: eng
license: https://creativecommons.org/licenses/by-nc-sa/4.0/
month: '09'
oa: 1
oa_version: Published Version
page: '102'
project:
- _id: 268A44D6-B435-11E9-9278-68D0E5697425
  call_identifier: H2020
  grant_number: '805223'
  name: Elastic Coordination for Scalable Machine Learning
publication_identifier:
  issn:
  - 2663-337X
publication_status: published
publisher: Institute of Science and Technology Austria
related_material:
  record:
  - id: '17456'
    relation: part_of_dissertation
    status: public
  - id: '14461'
    relation: part_of_dissertation
    status: public
  - id: '12780'
    relation: part_of_dissertation
    status: public
status: public
supervisor:
- first_name: Dan-Adrian
  full_name: Alistarh, Dan-Adrian
  id: 4A899BFC-F248-11E8-B48F-1D18A9856A87
  last_name: Alistarh
  orcid: 0000-0003-3650-940X
title: 'Communication-efficient distributed training of deep neural networks : An
  algorithms and systems perspective'
tmp:
  image: /images/cc_by_nc_sa.png
  legal_code_url: https://creativecommons.org/licenses/by-nc-sa/4.0/legalcode
  name: Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC
    BY-NC-SA 4.0)
  short: CC BY-NC-SA (4.0)
type: dissertation
user_id: ba8df636-2132-11f1-aed0-ed93e2281fdd
year: '2024'
...
---
_id: '17456'
abstract:
- lang: eng
  text: "Data-parallel distributed training of deep neural networks (DNN) has gained
    very widespread adoption, but can still experience communication bottlenecks.
    To address this issue, entire families of compression mechanisms have been developed,
    including quantization, sparsification, and low-rank approximation, some of which
    are seeing significant practical adoption. Despite this progress, almost all known
    compression schemes apply compression uniformly across DNN layers, although layers
    are heterogeneous in terms of parameter count and their impact on model accuracy.In
    this work, we provide a general framework for adapting the degree of compression
    across the model's layers dynamically during training, improving the overall compression,
    while leading to substantial speedups, without sacrificing accuracy. Our framework,
    called L-GreCo, is based on an adaptive algorithm, which automatically picks the
    optimal compression parameters for model layers guaranteeing the best compression
    ratio while satisfying an error constraint. Extensive experiments over image classification
    and language modeling tasks shows that L-GreCo is effective across all existing
    families of compression methods, and achieves up to 2.5\r\n×\r\n training speedup
    and up to 5\r\n×\r\n compression improvement over efficient implementations of
    existing approaches, while recovering full accuracy. Moreover, L-GreCo is complementary
    to existing adaptive algorithms, improving their compression ratio by 50\\% and
    practical throughput by 66\\%. An anonymized implementation is available at https://github.com/LGrCo/L-GreCo."
article_processing_charge: No
arxiv: 1
author:
- first_name: Ilia
  full_name: Markov, Ilia
  id: D0CF4148-C985-11E9-8066-0BDEE5697425
  last_name: Markov
- first_name: Kaveh
  full_name: Alimohammadi, Kaveh
  last_name: Alimohammadi
- first_name: Elias
  full_name: Frantar, Elias
  id: 09a8f98d-ec99-11ea-ae11-c063a7b7fe5f
  last_name: Frantar
- first_name: Dan-Adrian
  full_name: Alistarh, Dan-Adrian
  id: 4A899BFC-F248-11E8-B48F-1D18A9856A87
  last_name: Alistarh
  orcid: 0000-0003-3650-940X
citation:
  ama: 'Markov I, Alimohammadi K, Frantar E, Alistarh D-A. L-GreCo: Layerwise-adaptive
    gradient compression for efficient data-parallel deep learning. In: Gibbons P,
    Pekhimenko G, De Sa C, eds. <i>Proceedings of Machine Learning and Systems </i>.
    Vol 6. Association for Computing Machinery; 2024.'
  apa: 'Markov, I., Alimohammadi, K., Frantar, E., &#38; Alistarh, D.-A. (2024). L-GreCo:
    Layerwise-adaptive gradient compression for efficient data-parallel deep learning.
    In P. Gibbons, G. Pekhimenko, &#38; C. De Sa (Eds.), <i>Proceedings of Machine
    Learning and Systems </i> (Vol. 6). Athens, Greece: Association for Computing
    Machinery.'
  chicago: 'Markov, Ilia, Kaveh Alimohammadi, Elias Frantar, and Dan-Adrian Alistarh.
    “L-GreCo: Layerwise-Adaptive Gradient Compression for Efficient Data-Parallel
    Deep Learning.” In <i>Proceedings of Machine Learning and Systems </i>, edited
    by P. Gibbons, G. Pekhimenko, and C. De Sa, Vol. 6. Association for Computing
    Machinery, 2024.'
  ieee: 'I. Markov, K. Alimohammadi, E. Frantar, and D.-A. Alistarh, “L-GreCo: Layerwise-adaptive
    gradient compression for efficient data-parallel deep learning,” in <i>Proceedings
    of Machine Learning and Systems </i>, Athens, Greece, 2024, vol. 6.'
  ista: 'Markov I, Alimohammadi K, Frantar E, Alistarh D-A. 2024. L-GreCo: Layerwise-adaptive
    gradient compression for efficient data-parallel deep learning. Proceedings of
    Machine Learning and Systems . MLSys: Machine Learning and Systems vol. 6.'
  mla: 'Markov, Ilia, et al. “L-GreCo: Layerwise-Adaptive Gradient Compression for
    Efficient Data-Parallel Deep Learning.” <i>Proceedings of Machine Learning and
    Systems </i>, edited by P. Gibbons et al., vol. 6, Association for Computing Machinery,
    2024.'
  short: I. Markov, K. Alimohammadi, E. Frantar, D.-A. Alistarh, in:, P. Gibbons,
    G. Pekhimenko, C. De Sa (Eds.), Proceedings of Machine Learning and Systems ,
    Association for Computing Machinery, 2024.
conference:
  end_date: 2024-04-22
  location: Athens, Greece
  name: 'MLSys: Machine Learning and Systems'
  start_date: 2024-04-22
corr_author: '1'
date_created: 2024-08-22T08:29:25Z
date_published: 2024-04-01T00:00:00Z
date_updated: 2026-04-07T13:00:54Z
day: '01'
department:
- _id: DaAl
editor:
- first_name: P.
  full_name: Gibbons, P.
  last_name: Gibbons
- first_name: G.
  full_name: Pekhimenko, G.
  last_name: Pekhimenko
- first_name: C.
  full_name: De Sa, C.
  last_name: De Sa
external_id:
  arxiv:
  - '2210.17357'
intvolume: '         6'
language:
- iso: eng
main_file_link:
- open_access: '1'
  url: https://proceedings.mlsys.org/paper_files/paper/2024/hash/9069a8976ff06f6443e7f4172990a580-Abstract-Conference.html
month: '04'
oa: 1
oa_version: Published Version
publication: 'Proceedings of Machine Learning and Systems '
publication_status: published
publisher: Association for Computing Machinery
quality_controlled: '1'
related_material:
  record:
  - id: '17490'
    relation: dissertation_contains
    status: public
status: public
title: 'L-GreCo: Layerwise-adaptive gradient compression for efficient data-parallel
  deep learning'
type: conference
user_id: 8b945eb4-e2f2-11eb-945a-df72226e66a9
volume: 6
year: '2024'
...
---
_id: '14461'
abstract:
- lang: eng
  text: 'Communication-reduction techniques are a popular way to improve scalability
    in data-parallel training of deep neural networks (DNNs). The recent emergence
    of large language models such as GPT has created the need for new approaches to
    exploit data-parallelism. Among these, fully-sharded data parallel (FSDP) training
    is highly popular, yet it still encounters scalability bottlenecks. One reason
    is that applying compression techniques to FSDP is challenging: as the vast majority
    of the communication involves the model’s weights, direct compression alters convergence
    and leads to accuracy loss. We present QSDP, a variant of FSDP which supports
    both gradient and weight quantization with theoretical guarantees, is simple to
    implement and has essentially no overheads. To derive QSDP we prove that a natural
    modification of SGD achieves convergence even when we only maintain quantized
    weights, and thus the domain over which we train consists of quantized points
    and is, therefore, highly non-convex. We validate this approach by training GPT-family
    models with up to 1.3 billion parameters on a multi-node cluster. Experiments
    show that QSDP preserves model accuracy, while completely removing the communication
    bottlenecks of FSDP, providing end-to-end speedups of up to 2.2x.'
acknowledged_ssus:
- _id: ScienComp
acknowledgement: The authors gratefully acknowledge funding from the European Research
  Council (ERC) under the European Union’s Horizon 2020 research and innovation programme
  (grant agreement No 805223 ScaleML), as well as experimental support from the IST
  Austria IT department, in particular Stefano Elefante, Andrei Hornoiu, and Alois
  Schloegl. AV acknowledges the support of the French Agence Nationale de la Recherche
  (ANR), under grant ANR-21-CE48-0016 (project COMCOPT), the support of Fondation
  Hadamard with a PRMO grant, and the support of CNRS with a CoopIntEER IEA grant
  (project ALFRED).
alternative_title:
- PMLR
article_processing_charge: No
arxiv: 1
author:
- first_name: Ilia
  full_name: Markov, Ilia
  id: D0CF4148-C985-11E9-8066-0BDEE5697425
  last_name: Markov
- first_name: Adrian
  full_name: Vladu, Adrian
  last_name: Vladu
- first_name: Qi
  full_name: Guo, Qi
  last_name: Guo
- first_name: Dan-Adrian
  full_name: Alistarh, Dan-Adrian
  id: 4A899BFC-F248-11E8-B48F-1D18A9856A87
  last_name: Alistarh
  orcid: 0000-0003-3650-940X
citation:
  ama: 'Markov I, Vladu A, Guo Q, Alistarh D-A. Quantized distributed training of
    large models with convergence guarantees. In: <i>Proceedings of the 40th International
    Conference on Machine Learning</i>. Vol 202. ML Research Press; 2023:24020-24044.'
  apa: 'Markov, I., Vladu, A., Guo, Q., &#38; Alistarh, D.-A. (2023). Quantized distributed
    training of large models with convergence guarantees. In <i>Proceedings of the
    40th International Conference on Machine Learning</i> (Vol. 202, pp. 24020–24044).
    Honolulu, Hawaii, HI, United States: ML Research Press.'
  chicago: Markov, Ilia, Adrian Vladu, Qi Guo, and Dan-Adrian Alistarh. “Quantized
    Distributed Training of Large Models with Convergence Guarantees.” In <i>Proceedings
    of the 40th International Conference on Machine Learning</i>, 202:24020–44. ML
    Research Press, 2023.
  ieee: I. Markov, A. Vladu, Q. Guo, and D.-A. Alistarh, “Quantized distributed training
    of large models with convergence guarantees,” in <i>Proceedings of the 40th International
    Conference on Machine Learning</i>, Honolulu, Hawaii, HI, United States, 2023,
    vol. 202, pp. 24020–24044.
  ista: 'Markov I, Vladu A, Guo Q, Alistarh D-A. 2023. Quantized distributed training
    of large models with convergence guarantees. Proceedings of the 40th International
    Conference on Machine Learning. ICML: International Conference on Machine Learning,
    PMLR, vol. 202, 24020–24044.'
  mla: Markov, Ilia, et al. “Quantized Distributed Training of Large Models with Convergence
    Guarantees.” <i>Proceedings of the 40th International Conference on Machine Learning</i>,
    vol. 202, ML Research Press, 2023, pp. 24020–44.
  short: I. Markov, A. Vladu, Q. Guo, D.-A. Alistarh, in:, Proceedings of the 40th
    International Conference on Machine Learning, ML Research Press, 2023, pp. 24020–24044.
conference:
  end_date: 2023-07-29
  location: Honolulu, Hawaii, HI, United States
  name: 'ICML: International Conference on Machine Learning'
  start_date: 2023-07-23
corr_author: '1'
date_created: 2023-10-29T23:01:17Z
date_published: 2023-07-30T00:00:00Z
date_updated: 2026-04-07T13:00:54Z
day: '30'
department:
- _id: DaAl
ec_funded: 1
external_id:
  arxiv:
  - '2302.02390'
intvolume: '       202'
language:
- iso: eng
main_file_link:
- open_access: '1'
  url: https://doi.org/10.48550/arXiv.2302.02390
month: '07'
oa: 1
oa_version: Preprint
page: 24020-24044
project:
- _id: 268A44D6-B435-11E9-9278-68D0E5697425
  call_identifier: H2020
  grant_number: '805223'
  name: Elastic Coordination for Scalable Machine Learning
publication: Proceedings of the 40th International Conference on Machine Learning
publication_identifier:
  eissn:
  - 2640-3498
publication_status: published
publisher: ML Research Press
quality_controlled: '1'
related_material:
  record:
  - id: '17490'
    relation: dissertation_contains
    status: public
scopus_import: '1'
status: public
title: Quantized distributed training of large models with convergence guarantees
type: conference
user_id: 2DF688A6-F248-11E8-B48F-1D18A9856A87
volume: 202
year: '2023'
...
---
_id: '12780'
abstract:
- lang: eng
  text: "The ability to scale out training workloads has been one of the key performance
    enablers of deep learning. The main scaling approach is data-parallel GPU-based
    training, which has been boosted by hardware and software support for highly efficient
    point-to-point communication, and in particular via hardware bandwidth over-provisioning.
    Overprovisioning comes at a cost: there is an order of magnitude price difference
    between \"cloud-grade\" servers with such support, relative to their popular \"consumer-grade\"
    counterparts, although single server-grade and consumer-grade GPUs can have similar
    computational envelopes.\r\n\r\nIn this paper, we show that the costly hardware
    overprovisioning approach can be supplanted via algorithmic and system design,
    and propose a framework called CGX, which provides efficient software support
    for compressed communication in ML applications, for both multi-GPU single-node
    training, as well as larger-scale multi-node training. CGX is based on two technical
    advances: At the system level, it relies on a re-developed communication stack
    for ML frameworks, which provides flexible, highly-efficient support for compressed
    communication. At the application level, it provides seamless, parameter-free
    integration with popular frameworks, so that end-users do not have to modify training
    recipes, nor significant training code. This is complemented by a layer-wise adaptive
    compression technique which dynamically balances compression gains with accuracy
    preservation. CGX integrates with popular ML frameworks, providing up to 3X speedups
    for multi-GPU nodes based on commodity hardware, and order-of-magnitude improvements
    in the multi-node setting, with negligible impact on accuracy."
acknowledgement: The authors sincerely thank Nikoli Dryden, Tal Ben-Nun, Torsten Hoefler
  and Bapi Chatterjee for useful discussions throughout the development of this project.
article_processing_charge: Yes (via OA deal)
arxiv: 1
author:
- first_name: Ilia
  full_name: Markov, Ilia
  id: D0CF4148-C985-11E9-8066-0BDEE5697425
  last_name: Markov
- first_name: Hamidreza
  full_name: Ramezanikebrya, Hamidreza
  last_name: Ramezanikebrya
- first_name: Dan-Adrian
  full_name: Alistarh, Dan-Adrian
  id: 4A899BFC-F248-11E8-B48F-1D18A9856A87
  last_name: Alistarh
  orcid: 0000-0003-3650-940X
citation:
  ama: 'Markov I, Ramezanikebrya H, Alistarh D-A. CGX: Adaptive system support for
    communication-efficient deep learning. In: <i>Proceedings of the 23rd ACM/IFIP
    International Middleware Conference</i>. Association for Computing Machinery;
    2022:241-254. doi:<a href="https://doi.org/10.1145/3528535.3565248">10.1145/3528535.3565248</a>'
  apa: 'Markov, I., Ramezanikebrya, H., &#38; Alistarh, D.-A. (2022). CGX: Adaptive
    system support for communication-efficient deep learning. In <i>Proceedings of
    the 23rd ACM/IFIP International Middleware Conference</i> (pp. 241–254). Quebec,
    QC, Canada: Association for Computing Machinery. <a href="https://doi.org/10.1145/3528535.3565248">https://doi.org/10.1145/3528535.3565248</a>'
  chicago: 'Markov, Ilia, Hamidreza Ramezanikebrya, and Dan-Adrian Alistarh. “CGX:
    Adaptive System Support for Communication-Efficient Deep Learning.” In <i>Proceedings
    of the 23rd ACM/IFIP International Middleware Conference</i>, 241–54. Association
    for Computing Machinery, 2022. <a href="https://doi.org/10.1145/3528535.3565248">https://doi.org/10.1145/3528535.3565248</a>.'
  ieee: 'I. Markov, H. Ramezanikebrya, and D.-A. Alistarh, “CGX: Adaptive system support
    for communication-efficient deep learning,” in <i>Proceedings of the 23rd ACM/IFIP
    International Middleware Conference</i>, Quebec, QC, Canada, 2022, pp. 241–254.'
  ista: 'Markov I, Ramezanikebrya H, Alistarh D-A. 2022. CGX: Adaptive system support
    for communication-efficient deep learning. Proceedings of the 23rd ACM/IFIP International
    Middleware Conference. Middleware: International Middleware Conference, 241–254.'
  mla: 'Markov, Ilia, et al. “CGX: Adaptive System Support for Communication-Efficient
    Deep Learning.” <i>Proceedings of the 23rd ACM/IFIP International Middleware Conference</i>,
    Association for Computing Machinery, 2022, pp. 241–54, doi:<a href="https://doi.org/10.1145/3528535.3565248">10.1145/3528535.3565248</a>.'
  short: I. Markov, H. Ramezanikebrya, D.-A. Alistarh, in:, Proceedings of the 23rd
    ACM/IFIP International Middleware Conference, Association for Computing Machinery,
    2022, pp. 241–254.
conference:
  end_date: 2022-11-11
  location: Quebec, QC, Canada
  name: 'Middleware: International Middleware Conference'
  start_date: 2022-11-07
corr_author: '1'
date_created: 2023-03-31T06:17:00Z
date_published: 2022-11-01T00:00:00Z
date_updated: 2026-04-07T13:00:54Z
day: '01'
ddc:
- '000'
department:
- _id: DaAl
doi: 10.1145/3528535.3565248
external_id:
  arxiv:
  - '2111.08617'
  isi:
  - '001061556200024'
file:
- access_level: open_access
  checksum: 1a397746235f245da5468819247ff663
  content_type: application/pdf
  creator: dernst
  date_created: 2023-04-03T06:17:58Z
  date_updated: 2023-04-03T06:17:58Z
  file_id: '12795'
  file_name: 2022_ACMMiddleware_Markov.pdf
  file_size: 1514169
  relation: main_file
  success: 1
file_date_updated: 2023-04-03T06:17:58Z
has_accepted_license: '1'
isi: 1
language:
- iso: eng
month: '11'
oa: 1
oa_version: Published Version
page: 241-254
publication: Proceedings of the 23rd ACM/IFIP International Middleware Conference
publication_identifier:
  isbn:
  - '9781450393409'
publication_status: published
publisher: Association for Computing Machinery
quality_controlled: '1'
related_material:
  record:
  - id: '17490'
    relation: dissertation_contains
    status: public
scopus_import: '1'
status: public
title: 'CGX: Adaptive system support for communication-efficient deep learning'
tmp:
  image: /images/cc_by.png
  legal_code_url: https://creativecommons.org/licenses/by/4.0/legalcode
  name: Creative Commons Attribution 4.0 International Public License (CC-BY 4.0)
  short: CC BY (4.0)
type: conference
user_id: 317138e5-6ab7-11ef-aa6d-ffef3953e345
year: '2022'
...
---
_id: '10432'
abstract:
- lang: eng
  text: One key element behind the recent progress of machine learning has been the
    ability to train machine learning models in large-scale distributed shared-memory
    and message-passing environments. Most of these models are trained employing variants
    of stochastic gradient descent (SGD) based optimization, but most methods involve
    some type of consistency relaxation relative to sequential SGD, to mitigate its
    large communication or synchronization costs at scale. In this paper, we introduce
    a general consistency condition covering communication-reduced and asynchronous
    distributed SGD implementations. Our framework, called elastic consistency, decouples
    the system-specific aspects of the implementation from the SGD convergence requirements,
    giving a general way to obtain convergence bounds for a wide variety of distributed
    SGD methods used in practice. Elastic consistency can be used to re-derive or
    improve several previous convergence bounds in message-passing and shared-memory
    settings, but also to analyze new models and distribution schemes. As a direct
    application, we propose and analyze a new synchronization-avoiding scheduling
    scheme for distributed SGD, and show that it can be used to efficiently train
    deep convolutional models for image classification.
acknowledgement: "We would like to thank Christopher De Sa for his feedback on an
  earlier draft of this paper, as well as the anonymous AAAI reviewers for their useful
  comments. This project has received\r\nfunding from the European Research Council
  (ERC) under the European Union’s Horizon 2020 research and innovation programme
  (grant agreement No 805223 ScaleML). Bapi\r\nChatterjee was supported by the European
  Union’s Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie
  grant agreement No. 754411 (ISTPlus)."
article_processing_charge: No
arxiv: 1
author:
- first_name: Giorgi
  full_name: Nadiradze, Giorgi
  id: 3279A00C-F248-11E8-B48F-1D18A9856A87
  last_name: Nadiradze
  orcid: 0000-0001-5634-0731
- first_name: Ilia
  full_name: Markov, Ilia
  id: D0CF4148-C985-11E9-8066-0BDEE5697425
  last_name: Markov
- first_name: Bapi
  full_name: Chatterjee, Bapi
  id: 3C41A08A-F248-11E8-B48F-1D18A9856A87
  last_name: Chatterjee
  orcid: 0000-0002-2742-4028
- first_name: 'Vyacheslav '
  full_name: 'Kungurtsev, Vyacheslav '
  last_name: Kungurtsev
- first_name: Dan-Adrian
  full_name: Alistarh, Dan-Adrian
  id: 4A899BFC-F248-11E8-B48F-1D18A9856A87
  last_name: Alistarh
  orcid: 0000-0003-3650-940X
citation:
  ama: 'Nadiradze G, Markov I, Chatterjee B, Kungurtsev V, Alistarh D-A. Elastic consistency:
    A practical consistency model for distributed stochastic gradient descent. In:
    <i>Proceedings of the AAAI Conference on Artificial Intelligence</i>. Vol 35.
    ; 2021:9037-9045.'
  apa: 'Nadiradze, G., Markov, I., Chatterjee, B., Kungurtsev, V., &#38; Alistarh,
    D.-A. (2021). Elastic consistency: A practical consistency model for distributed
    stochastic gradient descent. In <i>Proceedings of the AAAI Conference on Artificial
    Intelligence</i> (Vol. 35, pp. 9037–9045). Virtual.'
  chicago: 'Nadiradze, Giorgi, Ilia Markov, Bapi Chatterjee, Vyacheslav  Kungurtsev,
    and Dan-Adrian Alistarh. “Elastic Consistency: A Practical Consistency Model for
    Distributed Stochastic Gradient Descent.” In <i>Proceedings of the AAAI Conference
    on Artificial Intelligence</i>, 35:9037–45, 2021.'
  ieee: 'G. Nadiradze, I. Markov, B. Chatterjee, V. Kungurtsev, and D.-A. Alistarh,
    “Elastic consistency: A practical consistency model for distributed stochastic
    gradient descent,” in <i>Proceedings of the AAAI Conference on Artificial Intelligence</i>,
    Virtual, 2021, vol. 35, no. 10, pp. 9037–9045.'
  ista: 'Nadiradze G, Markov I, Chatterjee B, Kungurtsev V, Alistarh D-A. 2021. Elastic
    consistency: A practical consistency model for distributed stochastic gradient
    descent. Proceedings of the AAAI Conference on Artificial Intelligence. AAAI:
    Association for the Advancement of Artificial Intelligence vol. 35, 9037–9045.'
  mla: 'Nadiradze, Giorgi, et al. “Elastic Consistency: A Practical Consistency Model
    for Distributed Stochastic Gradient Descent.” <i>Proceedings of the AAAI Conference
    on Artificial Intelligence</i>, vol. 35, no. 10, 2021, pp. 9037–45.'
  short: G. Nadiradze, I. Markov, B. Chatterjee, V. Kungurtsev, D.-A. Alistarh, in:,
    Proceedings of the AAAI Conference on Artificial Intelligence, 2021, pp. 9037–9045.
conference:
  end_date: 2021-02-09
  location: Virtual
  name: 'AAAI: Association for the Advancement of Artificial Intelligence'
  start_date: 2021-02-02
date_created: 2021-12-09T09:21:35Z
date_published: 2021-05-18T00:00:00Z
date_updated: 2026-04-08T07:00:45Z
day: '18'
department:
- _id: DaAl
ec_funded: 1
external_id:
  arxiv:
  - '2001.05918'
intvolume: '        35'
issue: '10'
language:
- iso: eng
main_file_link:
- open_access: '1'
  url: https://ojs.aaai.org/index.php/AAAI/article/view/17092
month: '05'
oa: 1
oa_version: Published Version
page: 9037-9045
project:
- _id: 260C2330-B435-11E9-9278-68D0E5697425
  call_identifier: H2020
  grant_number: '754411'
  name: ISTplus - Postdoctoral Fellowships
- _id: 268A44D6-B435-11E9-9278-68D0E5697425
  call_identifier: H2020
  grant_number: '805223'
  name: Elastic Coordination for Scalable Machine Learning
publication: Proceedings of the AAAI Conference on Artificial Intelligence
publication_status: published
quality_controlled: '1'
related_material:
  record:
  - id: '10429'
    relation: dissertation_contains
    status: public
status: public
title: 'Elastic consistency: A practical consistency model for distributed stochastic
  gradient descent'
type: conference
user_id: 8b945eb4-e2f2-11eb-945a-df72226e66a9
volume: 35
year: '2021'
...
---
_id: '10049'
abstract:
- lang: eng
  text: While messaging systems with strong security guarantees are widely used in
    practice, designing a protocol that scales efficiently to large groups and enjoys
    similar security guarantees remains largely open. The two existing proposals to
    date are ART (Cohn-Gordon et al., CCS18) and TreeKEM (IETF, The Messaging Layer
    Security Protocol, draft). TreeKEM is the currently considered candidate by the
    IETF MLS working group, but dynamic group operations (i.e. adding and removing
    users) can cause efficiency issues. In this paper we formalize and analyze a variant
    of TreeKEM which we term Tainted TreeKEM (TTKEM for short). The basic idea underlying
    TTKEM was suggested by Millican (MLS mailing list, February 2018). This version
    is more efficient than TreeKEM for some natural distributions of group operations,
    we quantify this through simulations.Our second contribution is two security proofs
    for TTKEM which establish post compromise and forward secrecy even against adaptive
    attackers. The security loss (to the underlying PKE) in the Random Oracle Model
    is a polynomial factor, and a quasipolynomial one in the Standard Model. Our proofs
    can be adapted to TreeKEM as well. Before our work no security proof for any TreeKEM-like
    protocol establishing tight security against an adversary who can adaptively choose
    the sequence of operations was known. We also are the first to prove (or even
    formalize) active security where the server can arbitrarily deviate from the protocol
    specification. Proving fully active security – where also the users can arbitrarily
    deviate – remains open.
acknowledgement: The first three authors contributed equally to this work. Funded
  by the European Research Council (ERC) under the European Union’s Horizon2020 research
  and innovation programme (682815-TOCNeT). Funded by the European Union’s Horizon
  2020 research and innovation programme under the Marie Skłodowska-Curie Grant Agreement
  No.665385.
article_processing_charge: No
author:
- first_name: Karen
  full_name: Klein, Karen
  id: 3E83A2F8-F248-11E8-B48F-1D18A9856A87
  last_name: Klein
- first_name: Guillermo
  full_name: Pascual Perez, Guillermo
  id: 2D7ABD02-F248-11E8-B48F-1D18A9856A87
  last_name: Pascual Perez
  orcid: 0000-0001-8630-415X
- first_name: Michael
  full_name: Walter, Michael
  id: 488F98B0-F248-11E8-B48F-1D18A9856A87
  last_name: Walter
  orcid: 0000-0003-3186-2482
- first_name: Chethan
  full_name: Kamath Hosdurg, Chethan
  id: 4BD3F30E-F248-11E8-B48F-1D18A9856A87
  last_name: Kamath Hosdurg
  orcid: 0009-0006-6812-7317
- first_name: Margarita
  full_name: Capretto, Margarita
  last_name: Capretto
- first_name: Miguel
  full_name: Cueto Noval, Miguel
  id: ffc563a3-f6e0-11ea-865d-e3cce03d17cc
  last_name: Cueto Noval
  orcid: 0000-0002-2505-4246
- first_name: Ilia
  full_name: Markov, Ilia
  id: D0CF4148-C985-11E9-8066-0BDEE5697425
  last_name: Markov
- first_name: Michelle X
  full_name: Yeo, Michelle X
  id: 2D82B818-F248-11E8-B48F-1D18A9856A87
  last_name: Yeo
  orcid: 0009-0001-3676-4809
- first_name: Joel F
  full_name: Alwen, Joel F
  id: 2A8DFA8C-F248-11E8-B48F-1D18A9856A87
  last_name: Alwen
- first_name: Krzysztof Z
  full_name: Pietrzak, Krzysztof Z
  id: 3E04A7AA-F248-11E8-B48F-1D18A9856A87
  last_name: Pietrzak
  orcid: 0000-0002-9139-1654
citation:
  ama: 'Klein K, Pascual Perez G, Walter M, et al. Keep the dirt: tainted TreeKEM,
    adaptively and actively secure continuous group key agreement. In: <i>2021 IEEE
    Symposium on Security and Privacy </i>. IEEE; 2021:268-284. doi:<a href="https://doi.org/10.1109/sp40001.2021.00035">10.1109/sp40001.2021.00035</a>'
  apa: 'Klein, K., Pascual Perez, G., Walter, M., Kamath Hosdurg, C., Capretto, M.,
    Cueto Noval, M., … Pietrzak, K. Z. (2021). Keep the dirt: tainted TreeKEM, adaptively
    and actively secure continuous group key agreement. In <i>2021 IEEE Symposium
    on Security and Privacy </i> (pp. 268–284). San Francisco, CA, United States:
    IEEE. <a href="https://doi.org/10.1109/sp40001.2021.00035">https://doi.org/10.1109/sp40001.2021.00035</a>'
  chicago: 'Klein, Karen, Guillermo Pascual Perez, Michael Walter, Chethan Kamath
    Hosdurg, Margarita Capretto, Miguel Cueto Noval, Ilia Markov, Michelle X Yeo,
    Joel F Alwen, and Krzysztof Z Pietrzak. “Keep the Dirt: Tainted TreeKEM, Adaptively
    and Actively Secure Continuous Group Key Agreement.” In <i>2021 IEEE Symposium
    on Security and Privacy </i>, 268–84. IEEE, 2021. <a href="https://doi.org/10.1109/sp40001.2021.00035">https://doi.org/10.1109/sp40001.2021.00035</a>.'
  ieee: 'K. Klein <i>et al.</i>, “Keep the dirt: tainted TreeKEM, adaptively and actively
    secure continuous group key agreement,” in <i>2021 IEEE Symposium on Security
    and Privacy </i>, San Francisco, CA, United States, 2021, pp. 268–284.'
  ista: 'Klein K, Pascual Perez G, Walter M, Kamath Hosdurg C, Capretto M, Cueto Noval
    M, Markov I, Yeo MX, Alwen JF, Pietrzak KZ. 2021. Keep the dirt: tainted TreeKEM,
    adaptively and actively secure continuous group key agreement. 2021 IEEE Symposium
    on Security and Privacy . SP: Symposium on Security and Privacy, 268–284.'
  mla: 'Klein, Karen, et al. “Keep the Dirt: Tainted TreeKEM, Adaptively and Actively
    Secure Continuous Group Key Agreement.” <i>2021 IEEE Symposium on Security and
    Privacy </i>, IEEE, 2021, pp. 268–84, doi:<a href="https://doi.org/10.1109/sp40001.2021.00035">10.1109/sp40001.2021.00035</a>.'
  short: K. Klein, G. Pascual Perez, M. Walter, C. Kamath Hosdurg, M. Capretto, M.
    Cueto Noval, I. Markov, M.X. Yeo, J.F. Alwen, K.Z. Pietrzak, in:, 2021 IEEE Symposium
    on Security and Privacy , IEEE, 2021, pp. 268–284.
conference:
  end_date: 2021-05-27
  location: San Francisco, CA, United States
  name: 'SP: Symposium on Security and Privacy'
  start_date: 2021-05-24
corr_author: '1'
date_created: 2021-09-27T13:46:27Z
date_published: 2021-08-26T00:00:00Z
date_updated: 2026-04-08T07:01:44Z
day: '26'
department:
- _id: KrPi
- _id: DaAl
doi: 10.1109/sp40001.2021.00035
ec_funded: 1
external_id:
  isi:
  - '001316065000016'
isi: 1
language:
- iso: eng
main_file_link:
- open_access: '1'
  url: https://eprint.iacr.org/2019/1489
month: '08'
oa: 1
oa_version: Preprint
page: 268-284
project:
- _id: 2564DBCA-B435-11E9-9278-68D0E5697425
  call_identifier: H2020
  grant_number: '665385'
  name: International IST Doctoral Program
- _id: 258AA5B2-B435-11E9-9278-68D0E5697425
  call_identifier: H2020
  grant_number: '682815'
  name: Teaching Old Crypto New Tricks
publication: '2021 IEEE Symposium on Security and Privacy '
publication_status: published
publisher: IEEE
quality_controlled: '1'
related_material:
  record:
  - id: '18088'
    relation: dissertation_contains
    status: public
  - id: '10035'
    relation: dissertation_contains
    status: public
scopus_import: '1'
status: public
title: 'Keep the dirt: tainted TreeKEM, adaptively and actively secure continuous
  group key agreement'
type: conference
user_id: 317138e5-6ab7-11ef-aa6d-ffef3953e345
year: '2021'
...
---
_id: '15086'
abstract:
- lang: eng
  text: "Many communication-efficient variants of SGD use gradient quantization schemes.
    These schemes are often heuristic and fixed over the course of training. We empirically
    observe that the statistics of gradients of deep models change during the training.
    Motivated by this observation, we introduce two adaptive quantization schemes,
    ALQ and AMQ. In both schemes, processors update their compression schemes in parallel
    by efficiently computing sufficient statistics of a parametric distribution. We
    improve the validation accuracy by almost 2% on CIFAR-10 and 1% on ImageNet in
    challenging low-cost communication setups. Our adaptive methods are also significantly
    more robust to the choice of hyperparameters.\r\n\r\n"
acknowledgement: "The authors would like to thank Blair Bilodeau, David Fleet, Mufan
  Li, and Jeffrey Negrea for\r\nhelpful discussions. FF was supported by OGS Scholarship.
  DA and IM were supported the\r\nEuropean Research Council (ERC) under the European
  Union’s Horizon 2020 research and innovation\r\nprogramme (grant agreement No 805223
  ScaleML). DMR was supported by an NSERC Discovery\r\nGrant. ARK was supported by
  NSERC Postdoctoral Fellowship. Resources used in preparing this research were provided,
  in part, by the Province of Ontario, the Government of Canada through CIFAR, and
  companies sponsoring the Vector Institute."
alternative_title:
- NeurIPS
article_processing_charge: No
arxiv: 1
author:
- first_name: 'Fartash '
  full_name: 'Faghri, Fartash '
  last_name: Faghri
- first_name: 'Iman '
  full_name: 'Tabrizian, Iman '
  last_name: Tabrizian
- first_name: Ilia
  full_name: Markov, Ilia
  id: D0CF4148-C985-11E9-8066-0BDEE5697425
  last_name: Markov
- first_name: Dan-Adrian
  full_name: Alistarh, Dan-Adrian
  id: 4A899BFC-F248-11E8-B48F-1D18A9856A87
  last_name: Alistarh
  orcid: 0000-0003-3650-940X
- first_name: 'Daniel '
  full_name: 'Roy, Daniel '
  last_name: Roy
- first_name: 'Ali '
  full_name: 'Ramezani-Kebrya, Ali '
  last_name: Ramezani-Kebrya
citation:
  ama: 'Faghri F, Tabrizian I, Markov I, Alistarh D-A, Roy D, Ramezani-Kebrya A. Adaptive
    gradient quantization for data-parallel SGD. In: <i>Advances in Neural Information
    Processing Systems</i>. Vol 33. Neural Information Processing Systems Foundation;
    2020.'
  apa: 'Faghri, F., Tabrizian, I., Markov, I., Alistarh, D.-A., Roy, D., &#38; Ramezani-Kebrya,
    A. (2020). Adaptive gradient quantization for data-parallel SGD. In <i>Advances
    in Neural Information Processing Systems</i> (Vol. 33). Vancouver, Canada: Neural
    Information Processing Systems Foundation.'
  chicago: Faghri, Fartash , Iman  Tabrizian, Ilia Markov, Dan-Adrian Alistarh, Daniel  Roy,
    and Ali  Ramezani-Kebrya. “Adaptive Gradient Quantization for Data-Parallel SGD.”
    In <i>Advances in Neural Information Processing Systems</i>, Vol. 33. Neural Information
    Processing Systems Foundation, 2020.
  ieee: F. Faghri, I. Tabrizian, I. Markov, D.-A. Alistarh, D. Roy, and A. Ramezani-Kebrya,
    “Adaptive gradient quantization for data-parallel SGD,” in <i>Advances in Neural
    Information Processing Systems</i>, Vancouver, Canada, 2020, vol. 33.
  ista: 'Faghri F, Tabrizian I, Markov I, Alistarh D-A, Roy D, Ramezani-Kebrya A.
    2020. Adaptive gradient quantization for data-parallel SGD. Advances in Neural
    Information Processing Systems. NeurIPS: Neural Information Processing Systems,
    NeurIPS, vol. 33.'
  mla: Faghri, Fartash, et al. “Adaptive Gradient Quantization for Data-Parallel SGD.”
    <i>Advances in Neural Information Processing Systems</i>, vol. 33, Neural Information
    Processing Systems Foundation, 2020.
  short: F. Faghri, I. Tabrizian, I. Markov, D.-A. Alistarh, D. Roy, A. Ramezani-Kebrya,
    in:, Advances in Neural Information Processing Systems, Neural Information Processing
    Systems Foundation, 2020.
conference:
  end_date: 2020-12-12
  location: Vancouver, Canada
  name: 'NeurIPS: Neural Information Processing Systems'
  start_date: 2020-12-06
date_created: 2024-03-06T08:35:58Z
date_published: 2020-12-10T00:00:00Z
date_updated: 2025-04-14T07:49:16Z
day: '10'
department:
- _id: DaAl
ec_funded: 1
external_id:
  arxiv:
  - '2010.12460'
intvolume: '        33'
language:
- iso: eng
main_file_link:
- open_access: '1'
  url: https://doi.org/10.48550/arXiv.2010.12460
month: '12'
oa: 1
oa_version: Preprint
project:
- _id: 268A44D6-B435-11E9-9278-68D0E5697425
  call_identifier: H2020
  grant_number: '805223'
  name: Elastic Coordination for Scalable Machine Learning
publication: Advances in Neural Information Processing Systems
publication_identifier:
  isbn:
  - '9781713829546'
publication_status: published
publisher: Neural Information Processing Systems Foundation
quality_controlled: '1'
status: public
title: Adaptive gradient quantization for data-parallel SGD
type: conference
user_id: 2DF688A6-F248-11E8-B48F-1D18A9856A87
volume: 33
year: '2020'
...
