---
OA_place: publisher
OA_type: diamond
_id: '20034'
abstract:
- lang: eng
  text: We introduce LDAdam, a memory-efficient optimizer for training large models,
    that performs adaptive optimization steps within lower dimensional subspaces,
    while consistently exploring the full parameter space during training. This strategy
    keeps the optimizer's memory footprint to a fraction of the model size. LDAdam
    relies on a new projection-aware update rule for the optimizer states that allows
    for transitioning between subspaces, i.e., estimation of the statistics of the
    projected gradients. To mitigate the errors due to low-rank projection, LDAdam
    integrates a new generalized error feedback mechanism, which explicitly accounts
    for both gradient and optimizer state compression. We prove the convergence of
    LDAdam under standard assumptions, and provide empirical evidence that LDAdam
    allows for efficient fine-tuning and pre-training of language models.
article_processing_charge: No
arxiv: 1
author:
- first_name: Thomas
  full_name: Robert, Thomas
  last_name: Robert
- first_name: Mher
  full_name: Safaryan, Mher
  id: dd546b39-0804-11ed-9c55-ef075c39778d
  last_name: Safaryan
- first_name: Ionut-Vlad
  full_name: Modoranu, Ionut-Vlad
  id: 449f7a18-f128-11eb-9611-9b430c0c6333
  last_name: Modoranu
- first_name: Dan-Adrian
  full_name: Alistarh, Dan-Adrian
  id: 4A899BFC-F248-11E8-B48F-1D18A9856A87
  last_name: Alistarh
  orcid: 0000-0003-3650-940X
citation:
  ama: 'Robert T, Safaryan M, Modoranu I-V, Alistarh D-A. LDAdam: Adaptive optimization
    from low-dimensional gradient statistics. In: <i>13th International Conference
    on Learning Representations</i>. ICLR; 2025:101877-101913.'
  apa: 'Robert, T., Safaryan, M., Modoranu, I.-V., &#38; Alistarh, D.-A. (2025). LDAdam:
    Adaptive optimization from low-dimensional gradient statistics. In <i>13th International
    Conference on Learning Representations</i> (pp. 101877–101913). Singapore, Singapore:
    ICLR.'
  chicago: 'Robert, Thomas, Mher Safaryan, Ionut-Vlad Modoranu, and Dan-Adrian Alistarh.
    “LDAdam: Adaptive Optimization from Low-Dimensional Gradient Statistics.” In <i>13th
    International Conference on Learning Representations</i>, 101877–913. ICLR, 2025.'
  ieee: 'T. Robert, M. Safaryan, I.-V. Modoranu, and D.-A. Alistarh, “LDAdam: Adaptive
    optimization from low-dimensional gradient statistics,” in <i>13th International
    Conference on Learning Representations</i>, Singapore, Singapore, 2025, pp. 101877–101913.'
  ista: 'Robert T, Safaryan M, Modoranu I-V, Alistarh D-A. 2025. LDAdam: Adaptive
    optimization from low-dimensional gradient statistics. 13th International Conference
    on Learning Representations. ICLR: International Conference on Learning Representations,
    101877–101913.'
  mla: 'Robert, Thomas, et al. “LDAdam: Adaptive Optimization from Low-Dimensional
    Gradient Statistics.” <i>13th International Conference on Learning Representations</i>,
    ICLR, 2025, pp. 101877–913.'
  short: T. Robert, M. Safaryan, I.-V. Modoranu, D.-A. Alistarh, in:, 13th International
    Conference on Learning Representations, ICLR, 2025, pp. 101877–101913.
conference:
  end_date: 2025-04-28
  location: Singapore, Singapore
  name: 'ICLR: International Conference on Learning Representations'
  start_date: 2025-04-24
corr_author: '1'
date_created: 2025-07-20T22:02:02Z
date_published: 2025-04-01T00:00:00Z
date_updated: 2025-08-04T08:41:10Z
day: '01'
ddc:
- '000'
department:
- _id: DaAl
external_id:
  arxiv:
  - '2410.16103'
file:
- access_level: open_access
  checksum: 9327d82569358d7bf1c3ec1a9952e721
  content_type: application/pdf
  creator: dernst
  date_created: 2025-08-04T08:39:51Z
  date_updated: 2025-08-04T08:39:51Z
  file_id: '20113'
  file_name: 2025_ICLR_Robert.pdf
  file_size: 1346111
  relation: main_file
  success: 1
file_date_updated: 2025-08-04T08:39:51Z
has_accepted_license: '1'
language:
- iso: eng
month: '04'
oa: 1
oa_version: Published Version
page: 101877-101913
publication: 13th International Conference on Learning Representations
publication_identifier:
  isbn:
  - '9798331320850'
publication_status: published
publisher: ICLR
quality_controlled: '1'
related_material:
  link:
  - relation: software
    url: https://github.com/IST-DASLab/LDAdam
scopus_import: '1'
status: public
title: 'LDAdam: Adaptive optimization from low-dimensional gradient statistics'
tmp:
  image: /images/cc_by.png
  legal_code_url: https://creativecommons.org/licenses/by/4.0/legalcode
  name: Creative Commons Attribution 4.0 International Public License (CC-BY 4.0)
  short: CC BY (4.0)
type: conference
user_id: 2DF688A6-F248-11E8-B48F-1D18A9856A87
year: '2025'
...
---
OA_place: repository
OA_type: green
_id: '18975'
abstract:
- lang: eng
  text: Leveraging second-order information about the loss at the scale of deep networks
    is one of the main lines of approach for improving the performance of current
    optimizers for deep learning. Yet, existing approaches for accurate full-matrix
    preconditioning, such as Full-Matrix Adagrad (GGT) or Matrix-Free Approximate
    Curvature (M-FAC) suffer from massive storage costs when applied even to small-scale
    models, as they must store a sliding window of gradients, whose memory requirements
    are multiplicative in the model dimension. In this paper, we address this issue
    via a novel and efficient error-feedback technique that can be applied to compress
    preconditioners by up to two orders of magnitude in practice, without loss of
    convergence. Specifically, our approach compresses the gradient information via
    sparsification or low-rank compression before it is fed into the preconditioner,
    feeding the compression error back into future iterations. Extensive experiments
    on deep neural networks show that this approach can compress full-matrix preconditioners
    to up to 99% sparsity without accuracy loss, effectively removing the memory overhead
    of fullmatrix preconditioners such as GGT and M-FAC.
acknowledged_ssus:
- _id: CampIT
acknowledgement: The authors thank Adrian Vladu, Razvan Pascanu, Alexandra Peste,
  Mher Safaryan for their valuable feedback, the IT department from Institute of Science
  and Technology Austria for the hardware support and Weights and Biases for the infrastructure
  to track all our experiments.
alternative_title:
- PMLR
article_processing_charge: No
arxiv: 1
author:
- first_name: Ionut-Vlad
  full_name: Modoranu, Ionut-Vlad
  id: 449f7a18-f128-11eb-9611-9b430c0c6333
  last_name: Modoranu
- first_name: Aleksei
  full_name: Kalinov, Aleksei
  id: 44b7120e-eb97-11eb-a6c2-e1557aa81d02
  last_name: Kalinov
  orcid: 0000-0003-2189-3904
- first_name: Eldar
  full_name: Kurtic, Eldar
  id: 47beb3a5-07b5-11eb-9b87-b108ec578218
  last_name: Kurtic
- first_name: Elias
  full_name: Frantar, Elias
  id: 09a8f98d-ec99-11ea-ae11-c063a7b7fe5f
  last_name: Frantar
- first_name: Dan-Adrian
  full_name: Alistarh, Dan-Adrian
  id: 4A899BFC-F248-11E8-B48F-1D18A9856A87
  last_name: Alistarh
  orcid: 0000-0003-3650-940X
citation:
  ama: 'Modoranu I-V, Kalinov A, Kurtic E, Frantar E, Alistarh D-A. Error feedback
    can accurately compress preconditioners. In: <i>41st International Conference
    on Machine Learning</i>. Vol 235. ML Research Press; 2024:35910-35933.'
  apa: 'Modoranu, I.-V., Kalinov, A., Kurtic, E., Frantar, E., &#38; Alistarh, D.-A.
    (2024). Error feedback can accurately compress preconditioners. In <i>41st International
    Conference on Machine Learning</i> (Vol. 235, pp. 35910–35933). Vienna, Austria:
    ML Research Press.'
  chicago: Modoranu, Ionut-Vlad, Aleksei Kalinov, Eldar Kurtic, Elias Frantar, and
    Dan-Adrian Alistarh. “Error Feedback Can Accurately Compress Preconditioners.”
    In <i>41st International Conference on Machine Learning</i>, 235:35910–33. ML
    Research Press, 2024.
  ieee: I.-V. Modoranu, A. Kalinov, E. Kurtic, E. Frantar, and D.-A. Alistarh, “Error
    feedback can accurately compress preconditioners,” in <i>41st International Conference
    on Machine Learning</i>, Vienna, Austria, 2024, vol. 235, pp. 35910–35933.
  ista: 'Modoranu I-V, Kalinov A, Kurtic E, Frantar E, Alistarh D-A. 2024. Error feedback
    can accurately compress preconditioners. 41st International Conference on Machine
    Learning. ICML: International Conference on Machine Learning, PMLR, vol. 235,
    35910–35933.'
  mla: Modoranu, Ionut-Vlad, et al. “Error Feedback Can Accurately Compress Preconditioners.”
    <i>41st International Conference on Machine Learning</i>, vol. 235, ML Research
    Press, 2024, pp. 35910–33.
  short: I.-V. Modoranu, A. Kalinov, E. Kurtic, E. Frantar, D.-A. Alistarh, in:, 41st
    International Conference on Machine Learning, ML Research Press, 2024, pp. 35910–35933.
conference:
  end_date: 2024-07-27
  location: Vienna, Austria
  name: 'ICML: International Conference on Machine Learning'
  start_date: 2024-07-21
corr_author: '1'
date_created: 2025-01-30T07:53:22Z
date_published: 2024-07-30T00:00:00Z
date_updated: 2025-01-30T07:54:16Z
day: '30'
department:
- _id: DaAl
external_id:
  arxiv:
  - '2306.06098'
intvolume: '       235'
language:
- iso: eng
main_file_link:
- open_access: '1'
  url: https://doi.org/10.48550/arXiv.2306.06098
month: '07'
oa: 1
oa_version: Preprint
page: 35910-35933
publication: 41st International Conference on Machine Learning
publication_identifier:
  eissn:
  - 2640-3498
publication_status: published
publisher: ML Research Press
quality_controlled: '1'
scopus_import: '1'
status: public
title: Error feedback can accurately compress preconditioners
type: conference
user_id: 2DF688A6-F248-11E8-B48F-1D18A9856A87
volume: 235
year: '2024'
...
---
OA_place: repository
OA_type: green
_id: '19510'
abstract:
- lang: eng
  text: "We propose a new variant of the Adam optimizer [Kingma and Ba, 2014] called\r\nMICROADAM
    that specifically minimizes memory overheads, while maintaining\r\ntheoretical
    convergence guarantees. We achieve this by compressing the gradient\r\ninformation
    before it is fed into the optimizer state, thereby reducing its memory\r\nfootprint
    significantly. We control the resulting compression error via a novel\r\ninstance
    of the classical error feedback mechanism from distributed optimization [Seide
    et al., 2014, Alistarh et al., 2018, Karimireddy et al., 2019] in which\r\nthe
    error correction information is itself compressed to allow for practical memory\r\ngains.
    We prove that the resulting approach maintains theoretical convergence\r\nguarantees
    competitive to those of AMSGrad, while providing good practical performance. Specifically,
    we show that MICROADAM can be implemented efficiently\r\non GPUs: on both million-scale
    (BERT) and billion-scale (LLaMA) models, MICROADAM provides practical convergence
    competitive to that of the uncompressed\r\nAdam baseline, with lower memory usage
    and similar running time. Our code is\r\navailable at https://github.com/IST-DASLab/MicroAdam."
acknowledged_ssus:
- _id: CampIT
acknowledgement: The authors thank Razvan Pascanu, Mahdi Nikdan and Soroush Tabesh
  for their valuable feedback, the IT department from Institute of Science and Technology
  Austria for the hardware support and Weights and Biases for the infrastructure to
  track all our experiments. Mher Safaryan has received funding from the European
  Union’s Horizon 2020 research and innovation program under the Marie Sklodowska-Curie
  grant agreement No 101034413.
alternative_title:
- Advances in Neural Information Processing Systems
article_processing_charge: No
arxiv: 1
author:
- first_name: Ionut-Vlad
  full_name: Modoranu, Ionut-Vlad
  id: 449f7a18-f128-11eb-9611-9b430c0c6333
  last_name: Modoranu
- first_name: Mher
  full_name: Safaryan, Mher
  id: dd546b39-0804-11ed-9c55-ef075c39778d
  last_name: Safaryan
- first_name: Grigory
  full_name: Malinovsky, Grigory
  last_name: Malinovsky
- first_name: Eldar
  full_name: Kurtic, Eldar
  id: 47beb3a5-07b5-11eb-9b87-b108ec578218
  last_name: Kurtic
- first_name: Thomas
  full_name: Robert, Thomas
  id: de632733-1457-11f0-ae22-b5914b8c1c41
  last_name: Robert
- first_name: Peter
  full_name: Richtárik, Peter
  last_name: Richtárik
- first_name: Dan-Adrian
  full_name: Alistarh, Dan-Adrian
  id: 4A899BFC-F248-11E8-B48F-1D18A9856A87
  last_name: Alistarh
  orcid: 0000-0003-3650-940X
citation:
  ama: 'Modoranu I-V, Safaryan M, Malinovsky G, et al. MICROADAM: Accurate adaptive
    optimization with low space overhead and provable convergence. In: <i>38th Conference
    on Neural Information Processing Systems</i>. Vol 37. Neural Information Processing
    Systems Foundation; 2024.'
  apa: 'Modoranu, I.-V., Safaryan, M., Malinovsky, G., Kurtic, E., Robert, T., Richtárik,
    P., &#38; Alistarh, D.-A. (2024). MICROADAM: Accurate adaptive optimization with
    low space overhead and provable convergence. In <i>38th Conference on Neural Information
    Processing Systems</i> (Vol. 37). Neural Information Processing Systems Foundation.'
  chicago: 'Modoranu, Ionut-Vlad, Mher Safaryan, Grigory Malinovsky, Eldar Kurtic,
    Thomas Robert, Peter Richtárik, and Dan-Adrian Alistarh. “MICROADAM: Accurate
    Adaptive Optimization with Low Space Overhead and Provable Convergence.” In <i>38th
    Conference on Neural Information Processing Systems</i>, Vol. 37. Neural Information
    Processing Systems Foundation, 2024.'
  ieee: 'I.-V. Modoranu <i>et al.</i>, “MICROADAM: Accurate adaptive optimization
    with low space overhead and provable convergence,” in <i>38th Conference on Neural
    Information Processing Systems</i>, 2024, vol. 37.'
  ista: 'Modoranu I-V, Safaryan M, Malinovsky G, Kurtic E, Robert T, Richtárik P,
    Alistarh D-A. 2024. MICROADAM: Accurate adaptive optimization with low space overhead
    and provable convergence. 38th Conference on Neural Information Processing Systems.
    , Advances in Neural Information Processing Systems, vol. 37.'
  mla: 'Modoranu, Ionut-Vlad, et al. “MICROADAM: Accurate Adaptive Optimization with
    Low Space Overhead and Provable Convergence.” <i>38th Conference on Neural Information
    Processing Systems</i>, vol. 37, Neural Information Processing Systems Foundation,
    2024.'
  short: I.-V. Modoranu, M. Safaryan, G. Malinovsky, E. Kurtic, T. Robert, P. Richtárik,
    D.-A. Alistarh, in:, 38th Conference on Neural Information Processing Systems,
    Neural Information Processing Systems Foundation, 2024.
corr_author: '1'
date_created: 2025-04-06T22:01:32Z
date_published: 2024-12-20T00:00:00Z
date_updated: 2025-05-14T11:32:52Z
day: '20'
department:
- _id: DaAl
ec_funded: 1
external_id:
  arxiv:
  - '2405.15593'
intvolume: '        37'
language:
- iso: eng
main_file_link:
- open_access: '1'
  url: https://doi.org/10.48550/arXiv.2405.15593
month: '12'
oa: 1
oa_version: Preprint
project:
- _id: fc2ed2f7-9c52-11eb-aca3-c01059dda49c
  call_identifier: H2020
  grant_number: '101034413'
  name: 'IST-BRIDGE: International postdoctoral program'
publication: 38th Conference on Neural Information Processing Systems
publication_identifier:
  issn:
  - 1049-5258
publication_status: published
publisher: Neural Information Processing Systems Foundation
quality_controlled: '1'
related_material:
  link:
  - relation: software
    url: https://github.com/IST-DASLab/MicroAdam
scopus_import: '1'
status: public
title: 'MICROADAM: Accurate adaptive optimization with low space overhead and provable
  convergence'
type: conference
user_id: 2DF688A6-F248-11E8-B48F-1D18A9856A87
volume: 37
year: '2024'
...
---
OA_place: repository
OA_type: green
_id: '19518'
abstract:
- lang: eng
  text: "The rising footprint of machine learning has led to a focus on imposing model\r\nsparsity
    as a means of reducing computational and memory costs. For deep neural\r\nnetworks
    (DNNs), the state-of-the-art accuracy-vs-sparsity is achieved by heuristics\r\ninspired
    by the classical Optimal Brain Surgeon (OBS) framework [LeCun et al.,\r\n1989,
    Hassibi and Stork, 1992, Hassibi et al., 1993], which leverages loss curvature\r\ninformation
    to make better pruning decisions. Yet, these results still lack a solid\r\ntheoretical
    understanding, and it is unclear whether they can be improved by\r\nleveraging
    connections to the wealth of work on sparse recovery algorithms. In this\r\npaper,
    we draw new connections between these two areas and present new sparse\r\nrecovery
    algorithms inspired by the OBS framework that comes with theoretical\r\nguarantees
    under reasonable assumptions and have strong practical performance.\r\nSpecifically,
    our work starts from the observation that we can leverage curvature\r\ninformation
    in OBS-like fashion upon the projection step of classic iterative sparse\r\nrecovery
    algorithms such as IHT. We show for the first time that this leads both\r\nto
    improved convergence bounds under standard assumptions. Furthermore, we\r\npresent
    extensions of this approach to the practical task of obtaining accurate sparse\r\nDNNs,
    and validate it experimentally at scale for Transformer-based models on\r\nvision
    and language tasks."
acknowledged_ssus:
- _id: CampIT
acknowledgement: The authors thank the anonymous NeurIPS reviewers for their useful
  comments and feedback, the IT department from the Institute of Science and Technology
  Austria for the hardware support, and Weights and Biases for the infrastructure
  to track all our experiments. Mher Safaryan has received funding from the European
  Union’s Horizon 2020 research and innovation program under the Maria Skłodowska-Curie
  grant agreement No 101034413.
alternative_title:
- Advances in Neural Information Processing Systems
article_processing_charge: No
arxiv: 1
author:
- first_name: Diyuan
  full_name: Wu, Diyuan
  id: 1a5914c2-896a-11ed-bdf8-fb80621a0635
  last_name: Wu
- first_name: Ionut-Vlad
  full_name: Modoranu, Ionut-Vlad
  id: 449f7a18-f128-11eb-9611-9b430c0c6333
  last_name: Modoranu
- first_name: Mher
  full_name: Safaryan, Mher
  id: dd546b39-0804-11ed-9c55-ef075c39778d
  last_name: Safaryan
- first_name: Denis
  full_name: Kuznedelev, Denis
  last_name: Kuznedelev
- first_name: Dan-Adrian
  full_name: Alistarh, Dan-Adrian
  id: 4A899BFC-F248-11E8-B48F-1D18A9856A87
  last_name: Alistarh
  orcid: 0000-0003-3650-940X
citation:
  ama: 'Wu D, Modoranu I-V, Safaryan M, Kuznedelev D, Alistarh D-A. The iterative
    optimal brain surgeon: Faster sparse recovery by leveraging second-order information.
    In: <i>38th Conference on Neural Information Processing Systems</i>. Vol 37. Neural
    Information Processing Systems Foundation; 2024.'
  apa: 'Wu, D., Modoranu, I.-V., Safaryan, M., Kuznedelev, D., &#38; Alistarh, D.-A.
    (2024). The iterative optimal brain surgeon: Faster sparse recovery by leveraging
    second-order information. In <i>38th Conference on Neural Information Processing
    Systems</i> (Vol. 37). Vancouver, Canada: Neural Information Processing Systems
    Foundation.'
  chicago: 'Wu, Diyuan, Ionut-Vlad Modoranu, Mher Safaryan, Denis Kuznedelev, and
    Dan-Adrian Alistarh. “The Iterative Optimal Brain Surgeon: Faster Sparse Recovery
    by Leveraging Second-Order Information.” In <i>38th Conference on Neural Information
    Processing Systems</i>, Vol. 37. Neural Information Processing Systems Foundation,
    2024.'
  ieee: 'D. Wu, I.-V. Modoranu, M. Safaryan, D. Kuznedelev, and D.-A. Alistarh, “The
    iterative optimal brain surgeon: Faster sparse recovery by leveraging second-order
    information,” in <i>38th Conference on Neural Information Processing Systems</i>,
    Vancouver, Canada, 2024, vol. 37.'
  ista: 'Wu D, Modoranu I-V, Safaryan M, Kuznedelev D, Alistarh D-A. 2024. The iterative
    optimal brain surgeon: Faster sparse recovery by leveraging second-order information.
    38th Conference on Neural Information Processing Systems. NeurIPS: Neural Information
    Processing Systems, Advances in Neural Information Processing Systems, vol. 37.'
  mla: 'Wu, Diyuan, et al. “The Iterative Optimal Brain Surgeon: Faster Sparse Recovery
    by Leveraging Second-Order Information.” <i>38th Conference on Neural Information
    Processing Systems</i>, vol. 37, Neural Information Processing Systems Foundation,
    2024.'
  short: D. Wu, I.-V. Modoranu, M. Safaryan, D. Kuznedelev, D.-A. Alistarh, in:, 38th
    Conference on Neural Information Processing Systems, Neural Information Processing
    Systems Foundation, 2024.
conference:
  end_date: 2024-12-15
  location: Vancouver, Canada
  name: 'NeurIPS: Neural Information Processing Systems'
  start_date: 2024-12-09
corr_author: '1'
date_created: 2025-04-06T22:01:32Z
date_published: 2024-12-20T00:00:00Z
date_updated: 2025-05-14T11:37:10Z
day: '20'
department:
- _id: DaAl
- _id: MaMo
ec_funded: 1
external_id:
  arxiv:
  - '2408.17163'
intvolume: '        37'
language:
- iso: eng
main_file_link:
- open_access: '1'
  url: https://doi.org/10.48550/arXiv.2408.17163
month: '12'
oa: 1
oa_version: Preprint
project:
- _id: fc2ed2f7-9c52-11eb-aca3-c01059dda49c
  call_identifier: H2020
  grant_number: '101034413'
  name: 'IST-BRIDGE: International postdoctoral program'
publication: 38th Conference on Neural Information Processing Systems
publication_identifier:
  issn:
  - 1049-5258
publication_status: published
publisher: Neural Information Processing Systems Foundation
quality_controlled: '1'
scopus_import: '1'
status: public
title: 'The iterative optimal brain surgeon: Faster sparse recovery by leveraging
  second-order information'
type: conference
user_id: 2DF688A6-F248-11E8-B48F-1D18A9856A87
volume: 37
year: '2024'
...
