The journey matters: Average parameter count over pre-training unifies sparse and dense scaling laws

Jin T, Humayun AI, Evci U, Subramanian S, Yazdanbakhsh A, Alistarh D-A, Dziugaite GK. 2025. The journey matters: Average parameter count over pre-training unifies sparse and dense scaling laws. 13th International Conference on Learning Representations. ICLR: International Conference on Learning Representations, 85165–85181.

Download
OA 2025_ICLR_Jin.pdf 704.99 KB [Published Version]
Conference Paper | Published | English

Scopus indexed
Author
Jin, Tian; Humayun, Ahmed Imtiaz; Evci, Utku; Subramanian, Suvinay; Yazdanbakhsh, Amir; Alistarh, Dan-AdrianISTA ; Dziugaite, Gintare Karolina
Department
Abstract
Pruning eliminates unnecessary parameters in neural networks; it offers a promising solution to the growing computational demands of large language models (LLMs). While many focus on post-training pruning, sparse pre-training--which combines pruning and pre-training into a single phase--provides a simpler alternative. In this work, we present the first systematic exploration of optimal sparse pre-training configurations for LLMs through an examination of 80 unique pruning schedules across different sparsity levels and training durations. We find that initiating pruning at 25% of total training compute and concluding at 75% achieves near-optimal final evaluation loss. These findings provide valuable insights for efficient and effective sparse pre-training of LLMs. Furthermore, we propose a new scaling law that modifies the Chinchilla scaling law to use the average parameter count over pre-training. Through empirical and theoretical validation, we demonstrate that this modified scaling law accurately models evaluation loss for both sparsely and densely pre-trained LLMs, unifying scaling laws across pre-training paradigms. Our findings indicate that while sparse pre-training achieves the same final model quality as dense pre-training for equivalent compute budgets, it provides substantial benefits through reduced model size, enabling significant potential computational savings during inference.
Publishing Year
Date Published
2025-04-01
Proceedings Title
13th International Conference on Learning Representations
Publisher
ICLR
Acknowledgement
We are deeply grateful to Elias Frantar, Naveen Kumar, Sanjiv Kumar, Daniel M. Roy, and Clemens Schaefer for their valuable feedback and thoughtful review of this paper. We also acknowledge the critical support provided by the Google CoreML Performance Team, and Google Research during this project. We further recognize the extended team at Google DeepMind, who enabled and supported this research direction. This work was in part supported by the Sloan Foundation, the MIT-IBM Watson AI Lab, Apple, and SRC JUMP 2.0 (CoCoSys).
Page
85165-85181
Conference
ICLR: International Conference on Learning Representations
Conference Location
Singapore, Singapore
Conference Date
2025-04-24 – 2025-04-28
IST-REx-ID

Cite this

Jin T, Humayun AI, Evci U, et al. The journey matters: Average parameter count over pre-training unifies sparse and dense scaling laws. In: 13th International Conference on Learning Representations. ICLR; 2025:85165-85181.
Jin, T., Humayun, A. I., Evci, U., Subramanian, S., Yazdanbakhsh, A., Alistarh, D.-A., & Dziugaite, G. K. (2025). The journey matters: Average parameter count over pre-training unifies sparse and dense scaling laws. In 13th International Conference on Learning Representations (pp. 85165–85181). Singapore, Singapore: ICLR.
Jin, Tian, Ahmed Imtiaz Humayun, Utku Evci, Suvinay Subramanian, Amir Yazdanbakhsh, Dan-Adrian Alistarh, and Gintare Karolina Dziugaite. “The Journey Matters: Average Parameter Count over Pre-Training Unifies Sparse and Dense Scaling Laws.” In 13th International Conference on Learning Representations, 85165–81. ICLR, 2025.
T. Jin et al., “The journey matters: Average parameter count over pre-training unifies sparse and dense scaling laws,” in 13th International Conference on Learning Representations, Singapore, Singapore, 2025, pp. 85165–85181.
Jin T, Humayun AI, Evci U, Subramanian S, Yazdanbakhsh A, Alistarh D-A, Dziugaite GK. 2025. The journey matters: Average parameter count over pre-training unifies sparse and dense scaling laws. 13th International Conference on Learning Representations. ICLR: International Conference on Learning Representations, 85165–85181.
Jin, Tian, et al. “The Journey Matters: Average Parameter Count over Pre-Training Unifies Sparse and Dense Scaling Laws.” 13th International Conference on Learning Representations, ICLR, 2025, pp. 85165–81.
All files available under the following license(s):
Creative Commons Attribution 4.0 International Public License (CC-BY 4.0):
Main File(s)
File Name
Access Level
OA Open Access
Date Uploaded
2025-08-04
MD5 Checksum
dbc27120e9aba67dffbd9e5d513a6803


Export

Marked Publications

Open Data ISTA Research Explorer

Sources

arXiv 2501.12486

Search this title in

Google Scholar
ISBN Search