Sparse Fine-Tuning for Inference Acceleration of Large Language Models
Kurtic E, Kuznedelev D, Frantar E, Goinv M, Pandit S, Agarwalla A, Nguyen T, Marques A, Kurtz M, Alistarh D-A. 2025.Sparse Fine-Tuning for Inference Acceleration of Large Language Models. In: Enhancing LLM Performance. Efficacy, Fine-Tuning, and Inference Techniques. Machine Translation: Technologies and Applications, , 83–97.
Download (ext.)
Book Chapter
| Published
| English
Author
Kurtic, EldarISTA;
Kuznedelev, Denis;
Frantar, EliasISTA;
Goinv, Michael;
Pandit, Shubhra;
Agarwalla, Abhinav;
Nguyen, Tuan;
Marques, Alexandre;
Kurtz, Mark;
Alistarh, Dan-AdrianISTA 
Book Editor
Passban, Peyman;
Way, Andy;
Rezagholizadeh, Mehdi
Corresponding author has ISTA affiliation
Department
Series Title
Machine Translation: Technologies and Applications
Abstract
We investigate the problem of accurate sparse fine-tuning of large language models (LLMs), that is, fine-tuning pre-trained LLMs on specialized tasks, while inducing sparsity in their weights. Our work is motivated by experiments showing that standard loss-based fine-tuning methods are not able to achieve high accuracy in this setting, especially at high sparsity targets. To address this issue, we perform a detailed study of knowledge distillation losses for fine-tuning of sparse models. We determine an L2-based distillation approach that we term ‘SquareHead’, which enables accurate recovery even at higher sparsities. Investigating the question of efficient inference, we show that sparse LLMs can be executed faster by taking advantage of sparsity. Specifically, we exhibit end-to-end results showing speedups enabled by sparsity, while recovering accuracy, on the following models and tasks, respectively: T5 for language translation, Whisper for speech translation, and open GPT-type models such as the Mosaic Pre-Trained Transformer (MPT) and Llama-2 models for text generation. In particular, for popular generative tasks, we show for the first time that sparse fine-tuning can reach 75% sparsity without drops in accuracy, and provide notable end-to-end speedups for inference on CPUs. Moreover, we also highlight that sparsity is compatible with other compression approaches, such as quantization.
Publishing Year
Date Published
2025-07-05
Book Title
Enhancing LLM Performance. Efficacy, Fine-Tuning, and Inference Techniques
Publisher
Springer Nature
Acknowledgement
We would like to thank Eugenia Iofinova for useful comments on an earlier version of this draft, and Artur Niederfahrenhorst for useful suggestions regarding fine-tuning on the GSM8k dataset.
Page
83-97
ISBN
ISSN
eISSN
IST-REx-ID
Cite this
Kurtic E, Kuznedelev D, Frantar E, et al. Sparse Fine-Tuning for Inference Acceleration of Large Language Models. In: Passban P, Way A, Rezagholizadeh M, eds. Enhancing LLM Performance. Efficacy, Fine-Tuning, and Inference Techniques. Springer Nature; 2025:83-97. doi:10.1007/978-3-031-85747-8_6
Kurtic, E., Kuznedelev, D., Frantar, E., Goinv, M., Pandit, S., Agarwalla, A., … Alistarh, D.-A. (2025). Sparse Fine-Tuning for Inference Acceleration of Large Language Models. In P. Passban, A. Way, & M. Rezagholizadeh (Eds.), Enhancing LLM Performance. Efficacy, Fine-Tuning, and Inference Techniques (pp. 83–97). Springer Nature. https://doi.org/10.1007/978-3-031-85747-8_6
Kurtic, Eldar, Denis Kuznedelev, Elias Frantar, Michael Goinv, Shubhra Pandit, Abhinav Agarwalla, Tuan Nguyen, Alexandre Marques, Mark Kurtz, and Dan-Adrian Alistarh. “Sparse Fine-Tuning for Inference Acceleration of Large Language Models.” In Enhancing LLM Performance. Efficacy, Fine-Tuning, and Inference Techniques, edited by Peyman Passban, Andy Way, and Mehdi Rezagholizadeh, 83–97. Springer Nature, 2025. https://doi.org/10.1007/978-3-031-85747-8_6.
E. Kurtic et al., “Sparse Fine-Tuning for Inference Acceleration of Large Language Models,” in Enhancing LLM Performance. Efficacy, Fine-Tuning, and Inference Techniques, P. Passban, A. Way, and M. Rezagholizadeh, Eds. Springer Nature, 2025, pp. 83–97.
Kurtic E, Kuznedelev D, Frantar E, Goinv M, Pandit S, Agarwalla A, Nguyen T, Marques A, Kurtz M, Alistarh D-A. 2025.Sparse Fine-Tuning for Inference Acceleration of Large Language Models. In: Enhancing LLM Performance. Efficacy, Fine-Tuning, and Inference Techniques. Machine Translation: Technologies and Applications, , 83–97.
Kurtic, Eldar, et al. “Sparse Fine-Tuning for Inference Acceleration of Large Language Models.” Enhancing LLM Performance. Efficacy, Fine-Tuning, and Inference Techniques, edited by Peyman Passban et al., Springer Nature, 2025, pp. 83–97, doi:10.1007/978-3-031-85747-8_6.
All files available under the following license(s):
Copyright Statement:
This Item is protected by copyright and/or related rights. [...]
Link(s) to Main File(s)
Access Level
Open Access
Export
Marked PublicationsOpen Data ISTA Research Explorer
Sources
arXiv 2310.06927
