---
OA_place: publisher
OA_type: diamond
_id: '20032'
abstract:
- lang: eng
  text: We propose Scalable Mechanistic Neural Network (S-MNN), an enhanced neural
    network framework designed for scientific machine learning applications involving
    long temporal sequences. By reformulating the original Mechanistic Neural Network
    (MNN) (Pervez et al., 2024), we reduce the computational time and space complexities
    from cubic and quadratic with respect to the sequence length, respectively, to
    linear. This significant improvement enables efficient modeling of long-term dynamics
    without sacrificing accuracy or interpretability. Extensive experiments demonstrate
    that S-MNN matches the original MNN in precision while substantially reducing
    computational resources. Consequently, S-MNN can drop-in replace the original
    MNN in applications, providing a practical and efficient tool for integrating
    mechanistic bottlenecks into neural network models of complex dynamical systems.
    Source code is available at https://github.com/IST-DASLab/ScalableMNN.
article_processing_charge: No
arxiv: 1
author:
- first_name: Jiale
  full_name: Chen, Jiale
  id: 4d0a9064-1ff6-11ee-9fa6-ec046c604785
  last_name: Chen
  orcid: 0000-0001-5337-5875
- first_name: Dingling
  full_name: Yao, Dingling
  id: d3e02e50-48a8-11ee-8f62-c108061797fa
  last_name: Yao
- first_name: Adeel A
  full_name: Pervez, Adeel A
  id: fca6d90c-d47f-11ee-bc87-93ff51604981
  last_name: Pervez
- first_name: Dan-Adrian
  full_name: Alistarh, Dan-Adrian
  id: 4A899BFC-F248-11E8-B48F-1D18A9856A87
  last_name: Alistarh
  orcid: 0000-0003-3650-940X
- first_name: Francesco
  full_name: Locatello, Francesco
  id: 26cfd52f-2483-11ee-8040-88983bcc06d4
  last_name: Locatello
  orcid: 0000-0002-4850-0683
citation:
  ama: 'Chen J, Yao D, Pervez AA, Alistarh D-A, Locatello F. Scalable mechanistic
    neural networks. In: <i>13th International Conference on Learning Representations</i>.
    ICLR; 2025:63716-63737.'
  apa: 'Chen, J., Yao, D., Pervez, A. A., Alistarh, D.-A., &#38; Locatello, F. (2025).
    Scalable mechanistic neural networks. In <i>13th International Conference on Learning
    Representations</i> (pp. 63716–63737). Singapore, Singapore: ICLR.'
  chicago: Chen, Jiale, Dingling Yao, Adeel A Pervez, Dan-Adrian Alistarh, and Francesco
    Locatello. “Scalable Mechanistic Neural Networks.” In <i>13th International Conference
    on Learning Representations</i>, 63716–37. ICLR, 2025.
  ieee: J. Chen, D. Yao, A. A. Pervez, D.-A. Alistarh, and F. Locatello, “Scalable
    mechanistic neural networks,” in <i>13th International Conference on Learning
    Representations</i>, Singapore, Singapore, 2025, pp. 63716–63737.
  ista: 'Chen J, Yao D, Pervez AA, Alistarh D-A, Locatello F. 2025. Scalable mechanistic
    neural networks. 13th International Conference on Learning Representations. ICLR:
    International Conference on Learning Representations, 63716–63737.'
  mla: Chen, Jiale, et al. “Scalable Mechanistic Neural Networks.” <i>13th International
    Conference on Learning Representations</i>, ICLR, 2025, pp. 63716–37.
  short: J. Chen, D. Yao, A.A. Pervez, D.-A. Alistarh, F. Locatello, in:, 13th International
    Conference on Learning Representations, ICLR, 2025, pp. 63716–63737.
conference:
  end_date: 2025-04-28
  location: Singapore, Singapore
  name: 'ICLR: International Conference on Learning Representations'
  start_date: 2025-04-24
corr_author: '1'
date_created: 2025-07-20T22:02:01Z
date_published: 2025-04-01T00:00:00Z
date_updated: 2025-08-04T08:03:11Z
day: '01'
ddc:
- '000'
department:
- _id: DaAl
- _id: FrLo
external_id:
  arxiv:
  - '2410.06074'
file:
- access_level: open_access
  checksum: 64cfdb12ae3e4e8ba57b1403e1066776
  content_type: application/pdf
  creator: dernst
  date_created: 2025-07-22T07:58:22Z
  date_updated: 2025-07-22T07:58:22Z
  file_id: '20065'
  file_name: 2025_ICLR_Chen.pdf
  file_size: 732745
  relation: main_file
  success: 1
file_date_updated: 2025-07-22T07:58:22Z
has_accepted_license: '1'
language:
- iso: eng
license: https://creativecommons.org/licenses/by/4.0/
month: '04'
oa: 1
oa_version: Published Version
page: 63716-63737
publication: 13th International Conference on Learning Representations
publication_identifier:
  isbn:
  - '9798331320850'
publication_status: published
publisher: ICLR
quality_controlled: '1'
related_material:
  link:
  - relation: software
    url: https://github.com/IST-DASLab/ScalableMNN
scopus_import: '1'
status: public
title: Scalable mechanistic neural networks
tmp:
  image: /images/cc_by.png
  legal_code_url: https://creativecommons.org/licenses/by/4.0/legalcode
  name: Creative Commons Attribution 4.0 International Public License (CC-BY 4.0)
  short: CC BY (4.0)
type: conference
user_id: 2DF688A6-F248-11E8-B48F-1D18A9856A87
year: '2025'
...
---
OA_place: publisher
OA_type: diamond
_id: '20033'
abstract:
- lang: eng
  text: 'A growing number of machine learning scenarios rely on knowledge distillation
    where one uses the output of a surrogate model as labels to supervise the training
    of a target model. In this work, we provide a sharp characterization of this process
    for ridgeless, high-dimensional regression, under two settings: (i) model shift,
    where the surrogate model is arbitrary, and (ii) distribution shift, where the
    surrogate model is the solution of empirical risk minimization with out-of-distribution
    data. In both cases, we characterize the precise risk of the target model through
    non-asymptotic bounds in terms of sample size and data distribution under mild
    conditions. As a consequence, we identify the form of the optimal surrogate model,
    which reveals the benefits and limitations of discarding weak features in a data-dependent
    fashion. In the context of weak-to-strong (W2S) generalization, this has the interpretation
    that (i) W2S training, with the surrogate as the weak model, can provably outperform
    training with strong labels under the same data budget, but (ii) it is unable
    to improve the data scaling law. We validate our results on numerical experiments
    both on ridgeless regression and on neural network architectures.'
acknowledgement: M.E.I., H.A.G., E.O.T., S.O. are supported by the NSF grants CCF-2046816,
  CCF-2403075, the Office of Naval Research grant N000142412289, an OpenAI Agentic
  AI Systems grant, and gifts by Open Philanthropy and Google Research. M. M. is funded
  by the European Union (ERC, INF2, project number 101161364). Views and opinions
  expressed are however those of the author(s) only and do not necessarily reflect
  those of the European Union or the European Research Council Executive Agency. Neither
  the European Union nor the granting authority can be held responsible for them.
article_processing_charge: No
arxiv: 1
author:
- first_name: M.
  full_name: Emrullah Ildiz, M.
  last_name: Emrullah Ildiz
- first_name: Halil Alperen
  full_name: Gozeten, Halil Alperen
  last_name: Gozeten
- first_name: Ege Onur
  full_name: Taga, Ege Onur
  last_name: Taga
- first_name: Marco
  full_name: Mondelli, Marco
  id: 27EB676C-8706-11E9-9510-7717E6697425
  last_name: Mondelli
  orcid: 0000-0002-3242-7020
- first_name: Samet
  full_name: Oymak, Samet
  last_name: Oymak
citation:
  ama: 'Emrullah Ildiz M, Gozeten HA, Taga EO, Mondelli M, Oymak S. High-dimensional
    analysis of knowledge distillation: Weak-to-Strong generalization and scaling
    laws. In: <i>13th International Conference on Learning Representations</i>. ICLR;
    2025:2967-3006.'
  apa: 'Emrullah Ildiz, M., Gozeten, H. A., Taga, E. O., Mondelli, M., &#38; Oymak,
    S. (2025). High-dimensional analysis of knowledge distillation: Weak-to-Strong
    generalization and scaling laws. In <i>13th International Conference on Learning
    Representations</i> (pp. 2967–3006). Singapore, Singapore: ICLR.'
  chicago: 'Emrullah Ildiz, M., Halil Alperen Gozeten, Ege Onur Taga, Marco Mondelli,
    and Samet Oymak. “High-Dimensional Analysis of Knowledge Distillation: Weak-to-Strong
    Generalization and Scaling Laws.” In <i>13th International Conference on Learning
    Representations</i>, 2967–3006. ICLR, 2025.'
  ieee: 'M. Emrullah Ildiz, H. A. Gozeten, E. O. Taga, M. Mondelli, and S. Oymak,
    “High-dimensional analysis of knowledge distillation: Weak-to-Strong generalization
    and scaling laws,” in <i>13th International Conference on Learning Representations</i>,
    Singapore, Singapore, 2025, pp. 2967–3006.'
  ista: 'Emrullah Ildiz M, Gozeten HA, Taga EO, Mondelli M, Oymak S. 2025. High-dimensional
    analysis of knowledge distillation: Weak-to-Strong generalization and scaling
    laws. 13th International Conference on Learning Representations. ICLR: International
    Conference on Learning Representations, 2967–3006.'
  mla: 'Emrullah Ildiz, M., et al. “High-Dimensional Analysis of Knowledge Distillation:
    Weak-to-Strong Generalization and Scaling Laws.” <i>13th International Conference
    on Learning Representations</i>, ICLR, 2025, pp. 2967–3006.'
  short: M. Emrullah Ildiz, H.A. Gozeten, E.O. Taga, M. Mondelli, S. Oymak, in:, 13th
    International Conference on Learning Representations, ICLR, 2025, pp. 2967–3006.
conference:
  end_date: 2025-04-28
  location: Singapore, Singapore
  name: 'ICLR: International Conference on Learning Representations'
  start_date: 2025-04-24
date_created: 2025-07-20T22:02:02Z
date_published: 2025-04-01T00:00:00Z
date_updated: 2025-08-04T08:33:58Z
day: '01'
ddc:
- '000'
department:
- _id: MaMo
external_id:
  arxiv:
  - '2410.18837'
file:
- access_level: open_access
  checksum: 5a38b093ebb4ee4eb662ea142621a5ca
  content_type: application/pdf
  creator: dernst
  date_created: 2025-08-04T08:32:38Z
  date_updated: 2025-08-04T08:32:38Z
  file_id: '20112'
  file_name: 2025_ICLR_Ildiz.pdf
  file_size: 528171
  relation: main_file
  success: 1
file_date_updated: 2025-08-04T08:32:38Z
has_accepted_license: '1'
language:
- iso: eng
month: '04'
oa: 1
oa_version: Published Version
page: 2967-3006
project:
- _id: 911e6d1f-16d5-11f0-9cad-c5c68c6a1cdf
  grant_number: '101161364'
  name: 'Inference in High Dimensions: Light-speed Algorithms and Information Limits'
publication: 13th International Conference on Learning Representations
publication_identifier:
  isbn:
  - '9798331320850'
publication_status: published
publisher: ICLR
quality_controlled: '1'
scopus_import: '1'
status: public
title: 'High-dimensional analysis of knowledge distillation: Weak-to-Strong generalization
  and scaling laws'
tmp:
  image: /images/cc_by.png
  legal_code_url: https://creativecommons.org/licenses/by/4.0/legalcode
  name: Creative Commons Attribution 4.0 International Public License (CC-BY 4.0)
  short: CC BY (4.0)
type: conference
user_id: 2DF688A6-F248-11E8-B48F-1D18A9856A87
year: '2025'
...
---
OA_place: publisher
OA_type: diamond
_id: '20034'
abstract:
- lang: eng
  text: We introduce LDAdam, a memory-efficient optimizer for training large models,
    that performs adaptive optimization steps within lower dimensional subspaces,
    while consistently exploring the full parameter space during training. This strategy
    keeps the optimizer's memory footprint to a fraction of the model size. LDAdam
    relies on a new projection-aware update rule for the optimizer states that allows
    for transitioning between subspaces, i.e., estimation of the statistics of the
    projected gradients. To mitigate the errors due to low-rank projection, LDAdam
    integrates a new generalized error feedback mechanism, which explicitly accounts
    for both gradient and optimizer state compression. We prove the convergence of
    LDAdam under standard assumptions, and provide empirical evidence that LDAdam
    allows for efficient fine-tuning and pre-training of language models.
article_processing_charge: No
arxiv: 1
author:
- first_name: Thomas
  full_name: Robert, Thomas
  last_name: Robert
- first_name: Mher
  full_name: Safaryan, Mher
  id: dd546b39-0804-11ed-9c55-ef075c39778d
  last_name: Safaryan
- first_name: Ionut-Vlad
  full_name: Modoranu, Ionut-Vlad
  id: 449f7a18-f128-11eb-9611-9b430c0c6333
  last_name: Modoranu
- first_name: Dan-Adrian
  full_name: Alistarh, Dan-Adrian
  id: 4A899BFC-F248-11E8-B48F-1D18A9856A87
  last_name: Alistarh
  orcid: 0000-0003-3650-940X
citation:
  ama: 'Robert T, Safaryan M, Modoranu I-V, Alistarh D-A. LDAdam: Adaptive optimization
    from low-dimensional gradient statistics. In: <i>13th International Conference
    on Learning Representations</i>. ICLR; 2025:101877-101913.'
  apa: 'Robert, T., Safaryan, M., Modoranu, I.-V., &#38; Alistarh, D.-A. (2025). LDAdam:
    Adaptive optimization from low-dimensional gradient statistics. In <i>13th International
    Conference on Learning Representations</i> (pp. 101877–101913). Singapore, Singapore:
    ICLR.'
  chicago: 'Robert, Thomas, Mher Safaryan, Ionut-Vlad Modoranu, and Dan-Adrian Alistarh.
    “LDAdam: Adaptive Optimization from Low-Dimensional Gradient Statistics.” In <i>13th
    International Conference on Learning Representations</i>, 101877–913. ICLR, 2025.'
  ieee: 'T. Robert, M. Safaryan, I.-V. Modoranu, and D.-A. Alistarh, “LDAdam: Adaptive
    optimization from low-dimensional gradient statistics,” in <i>13th International
    Conference on Learning Representations</i>, Singapore, Singapore, 2025, pp. 101877–101913.'
  ista: 'Robert T, Safaryan M, Modoranu I-V, Alistarh D-A. 2025. LDAdam: Adaptive
    optimization from low-dimensional gradient statistics. 13th International Conference
    on Learning Representations. ICLR: International Conference on Learning Representations,
    101877–101913.'
  mla: 'Robert, Thomas, et al. “LDAdam: Adaptive Optimization from Low-Dimensional
    Gradient Statistics.” <i>13th International Conference on Learning Representations</i>,
    ICLR, 2025, pp. 101877–913.'
  short: T. Robert, M. Safaryan, I.-V. Modoranu, D.-A. Alistarh, in:, 13th International
    Conference on Learning Representations, ICLR, 2025, pp. 101877–101913.
conference:
  end_date: 2025-04-28
  location: Singapore, Singapore
  name: 'ICLR: International Conference on Learning Representations'
  start_date: 2025-04-24
corr_author: '1'
date_created: 2025-07-20T22:02:02Z
date_published: 2025-04-01T00:00:00Z
date_updated: 2025-08-04T08:41:10Z
day: '01'
ddc:
- '000'
department:
- _id: DaAl
external_id:
  arxiv:
  - '2410.16103'
file:
- access_level: open_access
  checksum: 9327d82569358d7bf1c3ec1a9952e721
  content_type: application/pdf
  creator: dernst
  date_created: 2025-08-04T08:39:51Z
  date_updated: 2025-08-04T08:39:51Z
  file_id: '20113'
  file_name: 2025_ICLR_Robert.pdf
  file_size: 1346111
  relation: main_file
  success: 1
file_date_updated: 2025-08-04T08:39:51Z
has_accepted_license: '1'
language:
- iso: eng
month: '04'
oa: 1
oa_version: Published Version
page: 101877-101913
publication: 13th International Conference on Learning Representations
publication_identifier:
  isbn:
  - '9798331320850'
publication_status: published
publisher: ICLR
quality_controlled: '1'
related_material:
  link:
  - relation: software
    url: https://github.com/IST-DASLab/LDAdam
scopus_import: '1'
status: public
title: 'LDAdam: Adaptive optimization from low-dimensional gradient statistics'
tmp:
  image: /images/cc_by.png
  legal_code_url: https://creativecommons.org/licenses/by/4.0/legalcode
  name: Creative Commons Attribution 4.0 International Public License (CC-BY 4.0)
  short: CC BY (4.0)
type: conference
user_id: 2DF688A6-F248-11E8-B48F-1D18A9856A87
year: '2025'
...
---
OA_place: publisher
OA_type: diamond
_id: '20035'
abstract:
- lang: eng
  text: "Deep neural networks (DNNs) at convergence consistently represent the training
    data in the last layer via a geometric structure referred to as neural collapse.
    This empirical evidence has spurred a line of theoretical research aimed at proving
    the emergence of neural collapse, mostly focusing on the unconstrained features
    model. Here, the features of the penultimate layer are free variables, which makes
    the model data-agnostic and puts into question its ability to capture DNN training.
    Our work addresses the issue, moving away from unconstrained features and\r\nstudying
    DNNs that end with at least two linear layers. We first prove generic guarantees
    on neural collapse that assume (i) low training error and balancedness of linear
    layers (for within-class variability collapse), and (ii) bounded conditioning
    of the features before the linear part (for orthogonality of class-means, and
    their alignment with weight matrices). The balancedness refers to the fact that
    W⊤ℓ+1Wℓ+1 ≈ WℓW⊤ℓfor any pair of consecutive weight matrices of the linear part,
    and the bounded conditioning requires a well-behaved ratio between largest and
    smallest non-zero singular values of the features. We then show that such assumptions
    hold for gradient descent training with weight decay: (i) for networks with a
    wide first layer, we prove low training error and balancedness, and (ii) for solutions
    that are either nearly optimal or stable under large learning rates, we additionally
    prove the bounded conditioning. Taken together, our results are the first to show
    neural collapse in the end-to-end training of DNNs."
acknowledgement: M. M. and P. S. are funded by the European Union (ERC, INF2, project
  number 101161364). Views and opinions expressed are however those of the author(s)
  only and do not necessarily reflect those of the European Union or the European
  Research Council Executive Agency. Neither the European Union nor the granting authority
  can be held responsible for them.
article_processing_charge: No
arxiv: 1
author:
- first_name: Arthur
  full_name: Jacot, Arthur
  last_name: Jacot
- first_name: Peter
  full_name: Súkeník, Peter
  id: d64d6a8d-eb8e-11eb-b029-96fd216dec3c
  last_name: Súkeník
- first_name: Zihan
  full_name: Wang, Zihan
  last_name: Wang
- first_name: Marco
  full_name: Mondelli, Marco
  id: 27EB676C-8706-11E9-9510-7717E6697425
  last_name: Mondelli
  orcid: 0000-0002-3242-7020
citation:
  ama: 'Jacot A, Súkeník P, Wang Z, Mondelli M. Wide neural networks trained with
    weight decay provably exhibit neural collapse. In: <i>13th International Conference
    on Learning Representations</i>. ICLR; 2025:1905-1931.'
  apa: 'Jacot, A., Súkeník, P., Wang, Z., &#38; Mondelli, M. (2025). Wide neural networks
    trained with weight decay provably exhibit neural collapse. In <i>13th International
    Conference on Learning Representations</i> (pp. 1905–1931). Singapore, Singapore:
    ICLR.'
  chicago: Jacot, Arthur, Peter Súkeník, Zihan Wang, and Marco Mondelli. “Wide Neural
    Networks Trained with Weight Decay Provably Exhibit Neural Collapse.” In <i>13th
    International Conference on Learning Representations</i>, 1905–31. ICLR, 2025.
  ieee: A. Jacot, P. Súkeník, Z. Wang, and M. Mondelli, “Wide neural networks trained
    with weight decay provably exhibit neural collapse,” in <i>13th International
    Conference on Learning Representations</i>, Singapore, Singapore, 2025, pp. 1905–1931.
  ista: 'Jacot A, Súkeník P, Wang Z, Mondelli M. 2025. Wide neural networks trained
    with weight decay provably exhibit neural collapse. 13th International Conference
    on Learning Representations. ICLR: International Conference on Learning Representations,
    1905–1931.'
  mla: Jacot, Arthur, et al. “Wide Neural Networks Trained with Weight Decay Provably
    Exhibit Neural Collapse.” <i>13th International Conference on Learning Representations</i>,
    ICLR, 2025, pp. 1905–31.
  short: A. Jacot, P. Súkeník, Z. Wang, M. Mondelli, in:, 13th International Conference
    on Learning Representations, ICLR, 2025, pp. 1905–1931.
conference:
  end_date: 2025-04-28
  location: Singapore, Singapore
  name: 'ICLR: International Conference on Learning Representations'
  start_date: 2025-04-24
corr_author: '1'
date_created: 2025-07-20T22:02:02Z
date_published: 2025-04-01T00:00:00Z
date_updated: 2025-08-04T08:47:00Z
day: '01'
ddc:
- '000'
department:
- _id: MaMo
external_id:
  arxiv:
  - '2410.04887'
file:
- access_level: open_access
  checksum: 59c48c173887139647cc9839c0801136
  content_type: application/pdf
  creator: dernst
  date_created: 2025-08-04T08:45:43Z
  date_updated: 2025-08-04T08:45:43Z
  file_id: '20114'
  file_name: 2025_ICLR_Jacot.pdf
  file_size: 1337236
  relation: main_file
  success: 1
file_date_updated: 2025-08-04T08:45:43Z
has_accepted_license: '1'
language:
- iso: eng
month: '04'
oa: 1
oa_version: Published Version
page: 1905-1931
project:
- _id: 911e6d1f-16d5-11f0-9cad-c5c68c6a1cdf
  grant_number: '101161364'
  name: 'Inference in High Dimensions: Light-speed Algorithms and Information Limits'
publication: 13th International Conference on Learning Representations
publication_identifier:
  isbn:
  - '9798331320850'
publication_status: published
publisher: ICLR
quality_controlled: '1'
scopus_import: '1'
status: public
title: Wide neural networks trained with weight decay provably exhibit neural collapse
tmp:
  image: /images/cc_by.png
  legal_code_url: https://creativecommons.org/licenses/by/4.0/legalcode
  name: Creative Commons Attribution 4.0 International Public License (CC-BY 4.0)
  short: CC BY (4.0)
type: conference
user_id: 2DF688A6-F248-11E8-B48F-1D18A9856A87
year: '2025'
...
---
OA_place: publisher
OA_type: diamond
_id: '20036'
abstract:
- lang: eng
  text: 'We introduce NeCo: Patch Neighbor Consistency, a novel self-supervised training
    loss that enforces patch-level nearest neighbor consistency across a student and
    teacher model. Compared to contrastive approaches that only yield binary learning
    signals, i.e. "attract" and "repel", this approach benefits from the more fine-grained
    learning signal of sorting spatially dense features relative to reference patches.
    Our method leverages differentiable sorting applied on top of pretrained representations,
    such as DINOv2-registers to bootstrap the learning signal and further improve
    upon them. This dense post-pretraining leads to superior performance across various
    models and datasets, despite requiring only 19 hours on a single GPU. This method
    generates high-quality dense feature encoders and establishes several new state-of-the-art
    results such as +2.3 % and +4.2% for non-parametric in-context semantic segmentation
    on ADE20k and Pascal VOC, +1.6% and +4.8% for linear segmentation evaluations
    on COCO-Things and -Stuff and improvements in the 3D understanding of multi-view
    consistency on SPair-71k, by more than 1.5%.'
article_processing_charge: No
arxiv: 1
author:
- first_name: Valentinos
  full_name: Pariza, Valentinos
  last_name: Pariza
- first_name: Mohammadreza
  full_name: Salehi, Mohammadreza
  last_name: Salehi
- first_name: Gertjan
  full_name: Burghouts, Gertjan
  last_name: Burghouts
- first_name: Francesco
  full_name: Locatello, Francesco
  id: 26cfd52f-2483-11ee-8040-88983bcc06d4
  last_name: Locatello
  orcid: 0000-0002-4850-0683
- first_name: Yuki M.
  full_name: Asano, Yuki M.
  last_name: Asano
citation:
  ama: 'Pariza V, Salehi M, Burghouts G, Locatello F, Asano YM. Near, far: Patch-ordering
    enhances vision foundation models’ scene understanding. In: <i>13th International
    Conference on Learning Representations</i>. ICLR; 2025:72303-72330.'
  apa: 'Pariza, V., Salehi, M., Burghouts, G., Locatello, F., &#38; Asano, Y. M. (2025).
    Near, far: Patch-ordering enhances vision foundation models’ scene understanding.
    In <i>13th International Conference on Learning Representations</i> (pp. 72303–72330).
    Singapore, Singapore: ICLR.'
  chicago: 'Pariza, Valentinos, Mohammadreza Salehi, Gertjan Burghouts, Francesco
    Locatello, and Yuki M. Asano. “Near, Far: Patch-Ordering Enhances Vision Foundation
    Models’ Scene Understanding.” In <i>13th International Conference on Learning
    Representations</i>, 72303–30. ICLR, 2025.'
  ieee: 'V. Pariza, M. Salehi, G. Burghouts, F. Locatello, and Y. M. Asano, “Near,
    far: Patch-ordering enhances vision foundation models’ scene understanding,” in
    <i>13th International Conference on Learning Representations</i>, Singapore, Singapore,
    2025, pp. 72303–72330.'
  ista: 'Pariza V, Salehi M, Burghouts G, Locatello F, Asano YM. 2025. Near, far:
    Patch-ordering enhances vision foundation models’ scene understanding. 13th International
    Conference on Learning Representations. ICLR: International Conference on Learning
    Representations, 72303–72330.'
  mla: 'Pariza, Valentinos, et al. “Near, Far: Patch-Ordering Enhances Vision Foundation
    Models’ Scene Understanding.” <i>13th International Conference on Learning Representations</i>,
    ICLR, 2025, pp. 72303–30.'
  short: V. Pariza, M. Salehi, G. Burghouts, F. Locatello, Y.M. Asano, in:, 13th International
    Conference on Learning Representations, ICLR, 2025, pp. 72303–72330.
conference:
  end_date: 2025-04-28
  location: Singapore, Singapore
  name: 'ICLR: International Conference on Learning Representations'
  start_date: 2025-04-24
date_created: 2025-07-20T22:02:03Z
date_published: 2025-04-01T00:00:00Z
date_updated: 2025-08-04T08:10:55Z
day: '01'
ddc:
- '000'
department:
- _id: FrLo
external_id:
  arxiv:
  - '2408.11054'
file:
- access_level: open_access
  checksum: ddbe981f3ad3f6cb6daf12c954822eb8
  content_type: application/pdf
  creator: dernst
  date_created: 2025-08-04T08:09:43Z
  date_updated: 2025-08-04T08:09:43Z
  file_id: '20109'
  file_name: 2025_ICLR_Pariza.pdf
  file_size: 37788223
  relation: main_file
  success: 1
file_date_updated: 2025-08-04T08:09:43Z
has_accepted_license: '1'
language:
- iso: eng
month: '04'
oa: 1
oa_version: Published Version
page: 72303-72330
publication: 13th International Conference on Learning Representations
publication_identifier:
  isbn:
  - '9798331320850'
publication_status: published
publisher: ICLR
quality_controlled: '1'
scopus_import: '1'
status: public
title: 'Near, far: Patch-ordering enhances vision foundation models'' scene understanding'
tmp:
  image: /images/cc_by.png
  legal_code_url: https://creativecommons.org/licenses/by/4.0/legalcode
  name: Creative Commons Attribution 4.0 International Public License (CC-BY 4.0)
  short: CC BY (4.0)
type: conference
user_id: 2DF688A6-F248-11E8-B48F-1D18A9856A87
year: '2025'
...
---
OA_place: publisher
OA_type: diamond
_id: '20037'
abstract:
- lang: eng
  text: 'Disentangling polysemantic neurons is at the core of many current approaches
    to interpretability of large language models. Here we attempt to study how disentanglement
    can be used to understand performance, particularly under weight sparsity, a leading
    post-training optimization technique. We suggest a novel measure for estimating
    neuronal entanglement: the Wasserstein distance of a neuron''s output distribution
    to a Gaussian. Moreover, we show the existence of a small number of highly entangled
    "Wasserstein Neurons" in each linear layer of an LLM, characterized by their highly
    non-Gaussian output distributions, their role in mapping similar inputs to dissimilar
    outputs, and their significant impact on model accuracy. To study these phenomena,
    we propose a new experimental framework for disentangling polysemantic neurons.
    Our framework separates each layer''s inputs to create a mixture of experts where
    each neuron''s output is computed by a mixture of neurons of lower Wasserstein
    distance, each better at maintaining accuracy when sparsified without retraining.
    We provide strong evidence that this is because the mixture of sparse experts
    is effectively disentangling the input-output relationship of individual neurons,
    in particular the difficult Wasserstein neurons.'
acknowledgement: "The authors would like to extend their gratitude to Lori Leu for
  her insightful comments on the\r\napplication of the Wasserstein distance metric.
  We also wish to thank Elias Frantar for his help in\r\nworking with the SparseGPT
  implementation and his advice for the project. Additionally, we would like to thank
  Tony Tong Wang and Thomas Athey for their valuable feedback and constructive discussions.\r\nThis
  work was supported by an NIH Brains CONNECTS U01 grant and AMD’s AI & HPC Fund."
article_processing_charge: No
arxiv: 1
author:
- first_name: Shashata
  full_name: Sawmya, Shashata
  last_name: Sawmya
- first_name: Linghao
  full_name: Kong, Linghao
  last_name: Kong
- first_name: Ilia
  full_name: Markov, Ilia
  id: D0CF4148-C985-11E9-8066-0BDEE5697425
  last_name: Markov
- first_name: Dan-Adrian
  full_name: Alistarh, Dan-Adrian
  id: 4A899BFC-F248-11E8-B48F-1D18A9856A87
  last_name: Alistarh
  orcid: 0000-0003-3650-940X
- first_name: Nir
  full_name: Shavit, Nir
  last_name: Shavit
citation:
  ama: 'Sawmya S, Kong L, Markov I, Alistarh D-A, Shavit N. Wasserstein distances,
    neuronal entanglement, and sparsity. In: <i>13th International Conference on Learning
    Representations</i>. ICLR; 2025:26244-26274.'
  apa: 'Sawmya, S., Kong, L., Markov, I., Alistarh, D.-A., &#38; Shavit, N. (2025).
    Wasserstein distances, neuronal entanglement, and sparsity. In <i>13th International
    Conference on Learning Representations</i> (pp. 26244–26274). Singapore, Singapore:
    ICLR.'
  chicago: Sawmya, Shashata, Linghao Kong, Ilia Markov, Dan-Adrian Alistarh, and Nir
    Shavit. “Wasserstein Distances, Neuronal Entanglement, and Sparsity.” In <i>13th
    International Conference on Learning Representations</i>, 26244–74. ICLR, 2025.
  ieee: S. Sawmya, L. Kong, I. Markov, D.-A. Alistarh, and N. Shavit, “Wasserstein
    distances, neuronal entanglement, and sparsity,” in <i>13th International Conference
    on Learning Representations</i>, Singapore, Singapore, 2025, pp. 26244–26274.
  ista: 'Sawmya S, Kong L, Markov I, Alistarh D-A, Shavit N. 2025. Wasserstein distances,
    neuronal entanglement, and sparsity. 13th International Conference on Learning
    Representations. ICLR: International Conference on Learning Representations, 26244–26274.'
  mla: Sawmya, Shashata, et al. “Wasserstein Distances, Neuronal Entanglement, and
    Sparsity.” <i>13th International Conference on Learning Representations</i>, ICLR,
    2025, pp. 26244–74.
  short: S. Sawmya, L. Kong, I. Markov, D.-A. Alistarh, N. Shavit, in:, 13th International
    Conference on Learning Representations, ICLR, 2025, pp. 26244–26274.
conference:
  end_date: 2025-04-28
  location: Singapore, Singapore
  name: 'ICLR: International Conference on Learning Representations'
  start_date: 2025-04-24
corr_author: '1'
date_created: 2025-07-20T22:02:03Z
date_published: 2025-04-01T00:00:00Z
date_updated: 2025-08-04T08:16:43Z
day: '01'
ddc:
- '000'
department:
- _id: DaAl
external_id:
  arxiv:
  - '2405.15756'
file:
- access_level: open_access
  checksum: 39a8fa7dbdd7029859e156f53f20f6bc
  content_type: application/pdf
  creator: dernst
  date_created: 2025-08-04T08:14:09Z
  date_updated: 2025-08-04T08:14:09Z
  file_id: '20110'
  file_name: 2025_ICLR_Sawmya.pdf
  file_size: 5447177
  relation: main_file
  success: 1
file_date_updated: 2025-08-04T08:14:09Z
has_accepted_license: '1'
language:
- iso: eng
month: '04'
oa: 1
oa_version: Published Version
page: 26244-26274
publication: 13th International Conference on Learning Representations
publication_identifier:
  isbn:
  - '9798331320850'
publication_status: published
publisher: ICLR
quality_controlled: '1'
related_material:
  link:
  - relation: software
    url: https://github.com/Shavit-Lab/Sparse-Expansion
scopus_import: '1'
status: public
title: Wasserstein distances, neuronal entanglement, and sparsity
tmp:
  image: /images/cc_by.png
  legal_code_url: https://creativecommons.org/licenses/by/4.0/legalcode
  name: Creative Commons Attribution 4.0 International Public License (CC-BY 4.0)
  short: CC BY (4.0)
type: conference
user_id: 2DF688A6-F248-11E8-B48F-1D18A9856A87
year: '2025'
...
---
OA_place: publisher
OA_type: diamond
_id: '20038'
abstract:
- lang: eng
  text: Pruning eliminates unnecessary parameters in neural networks; it offers a
    promising solution to the growing computational demands of large language models
    (LLMs). While many focus on post-training pruning, sparse pre-training--which
    combines pruning and pre-training into a single phase--provides a simpler alternative.
    In this work, we present the first systematic exploration of optimal sparse pre-training
    configurations for LLMs through an examination of 80 unique pruning schedules
    across different sparsity levels and training durations. We find that initiating
    pruning at 25% of total training compute and concluding at 75% achieves near-optimal
    final evaluation loss. These findings provide valuable insights for efficient
    and effective sparse pre-training of LLMs. Furthermore, we propose a new scaling
    law that modifies the Chinchilla scaling law to use the average parameter count
    over pre-training. Through empirical and theoretical validation, we demonstrate
    that this modified scaling law accurately models evaluation loss for both sparsely
    and densely pre-trained LLMs, unifying scaling laws across pre-training paradigms.
    Our findings indicate that while sparse pre-training achieves the same final model
    quality as dense pre-training for equivalent compute budgets, it provides substantial
    benefits through reduced model size, enabling significant potential computational
    savings during inference.
acknowledgement: "We are deeply grateful to Elias Frantar, Naveen Kumar, Sanjiv Kumar,
  Daniel\r\nM. Roy, and Clemens Schaefer for their valuable feedback and thoughtful
  review of this paper.\r\nWe also acknowledge the critical support provided by the
  Google CoreML Performance Team, and Google Research during this project. We further
  recognize the extended team at Google DeepMind, who enabled and supported this research
  direction.\r\nThis work was in part supported by the Sloan Foundation, the MIT-IBM
  Watson AI Lab, Apple, and SRC JUMP 2.0 (CoCoSys)."
article_processing_charge: No
arxiv: 1
author:
- first_name: Tian
  full_name: Jin, Tian
  last_name: Jin
- first_name: Ahmed Imtiaz
  full_name: Humayun, Ahmed Imtiaz
  last_name: Humayun
- first_name: Utku
  full_name: Evci, Utku
  last_name: Evci
- first_name: Suvinay
  full_name: Subramanian, Suvinay
  last_name: Subramanian
- first_name: Amir
  full_name: Yazdanbakhsh, Amir
  last_name: Yazdanbakhsh
- first_name: Dan-Adrian
  full_name: Alistarh, Dan-Adrian
  id: 4A899BFC-F248-11E8-B48F-1D18A9856A87
  last_name: Alistarh
  orcid: 0000-0003-3650-940X
- first_name: Gintare Karolina
  full_name: Dziugaite, Gintare Karolina
  last_name: Dziugaite
citation:
  ama: 'Jin T, Humayun AI, Evci U, et al. The journey matters: Average parameter count
    over pre-training unifies sparse and dense scaling laws. In: <i>13th International
    Conference on Learning Representations</i>. ICLR; 2025:85165-85181.'
  apa: 'Jin, T., Humayun, A. I., Evci, U., Subramanian, S., Yazdanbakhsh, A., Alistarh,
    D.-A., &#38; Dziugaite, G. K. (2025). The journey matters: Average parameter count
    over pre-training unifies sparse and dense scaling laws. In <i>13th International
    Conference on Learning Representations</i> (pp. 85165–85181). Singapore, Singapore:
    ICLR.'
  chicago: 'Jin, Tian, Ahmed Imtiaz Humayun, Utku Evci, Suvinay Subramanian, Amir
    Yazdanbakhsh, Dan-Adrian Alistarh, and Gintare Karolina Dziugaite. “The Journey
    Matters: Average Parameter Count over Pre-Training Unifies Sparse and Dense Scaling
    Laws.” In <i>13th International Conference on Learning Representations</i>, 85165–81.
    ICLR, 2025.'
  ieee: 'T. Jin <i>et al.</i>, “The journey matters: Average parameter count over
    pre-training unifies sparse and dense scaling laws,” in <i>13th International
    Conference on Learning Representations</i>, Singapore, Singapore, 2025, pp. 85165–85181.'
  ista: 'Jin T, Humayun AI, Evci U, Subramanian S, Yazdanbakhsh A, Alistarh D-A, Dziugaite
    GK. 2025. The journey matters: Average parameter count over pre-training unifies
    sparse and dense scaling laws. 13th International Conference on Learning Representations.
    ICLR: International Conference on Learning Representations, 85165–85181.'
  mla: 'Jin, Tian, et al. “The Journey Matters: Average Parameter Count over Pre-Training
    Unifies Sparse and Dense Scaling Laws.” <i>13th International Conference on Learning
    Representations</i>, ICLR, 2025, pp. 85165–81.'
  short: T. Jin, A.I. Humayun, U. Evci, S. Subramanian, A. Yazdanbakhsh, D.-A. Alistarh,
    G.K. Dziugaite, in:, 13th International Conference on Learning Representations,
    ICLR, 2025, pp. 85165–85181.
conference:
  end_date: 2025-04-28
  location: Singapore, Singapore
  name: 'ICLR: International Conference on Learning Representations'
  start_date: 2025-04-24
date_created: 2025-07-20T22:02:03Z
date_published: 2025-04-01T00:00:00Z
date_updated: 2025-08-04T08:24:59Z
day: '01'
ddc:
- '000'
department:
- _id: DaAl
external_id:
  arxiv:
  - '2501.12486 '
file:
- access_level: open_access
  checksum: dbc27120e9aba67dffbd9e5d513a6803
  content_type: application/pdf
  creator: dernst
  date_created: 2025-08-04T08:23:47Z
  date_updated: 2025-08-04T08:23:47Z
  file_id: '20111'
  file_name: 2025_ICLR_Jin.pdf
  file_size: 704989
  relation: main_file
  success: 1
file_date_updated: 2025-08-04T08:23:47Z
has_accepted_license: '1'
language:
- iso: eng
month: '04'
oa: 1
oa_version: Published Version
page: 85165-85181
publication: 13th International Conference on Learning Representations
publication_identifier:
  isbn:
  - '9798331320850'
publication_status: published
publisher: ICLR
quality_controlled: '1'
scopus_import: '1'
status: public
title: 'The journey matters: Average parameter count over pre-training unifies sparse
  and dense scaling laws'
tmp:
  image: /images/cc_by.png
  legal_code_url: https://creativecommons.org/licenses/by/4.0/legalcode
  name: Creative Commons Attribution 4.0 International Public License (CC-BY 4.0)
  short: CC BY (4.0)
type: conference
user_id: 2DF688A6-F248-11E8-B48F-1D18A9856A87
year: '2025'
...
