Sparse Fine-Tuning for Inference Acceleration of Large Language Models

Kurtic, Eldar; Kuznedelev, Denis; Frantar, Elias; Goinv, Michael; Pandit, Shubhra; Agarwalla, Abhinav; Nguyen, Tuan; Marques, Alexandre; Kurtz, Mark; Alistarh, Dan-Adrian

Sparse Fine-Tuning for Inference Acceleration of Large Language Models

Kurtic E, Kuznedelev D, Frantar E, Goinv M, Pandit S, Agarwalla A, Nguyen T, Marques A, Kurtz M, Alistarh D-A. 2025.Sparse Fine-Tuning for Inference Acceleration of Large Language Models. In: Enhancing LLM Performance. Efficacy, Fine-Tuning, and Inference Techniques. Machine Translation: Technologies and Applications, , 83–97.

Download (ext.)

https://doi.org/10.48550/arXiv.2310.06927 [Preprint]

DOI

10.1007/978-3-031-85747-8_6

Book Chapter | Published | English

Author

Kurtic, Eldar^ISTA; Kuznedelev, Denis; Frantar, Elias^ISTA; Goinv, Michael; Pandit, Shubhra; Agarwalla, Abhinav; Nguyen, Tuan; Marques, Alexandre; Kurtz, Mark; Alistarh, Dan-Adrian^ISTA

Book Editor

Passban, Peyman; Way, Andy; Rezagholizadeh, Mehdi

Corresponding author has ISTA affiliation

Department

Alistarh Group
Graduate School

Series Title

Machine Translation: Technologies and Applications

Abstract

We investigate the problem of accurate sparse fine-tuning of large language models (LLMs), that is, fine-tuning pre-trained LLMs on specialized tasks, while inducing sparsity in their weights. Our work is motivated by experiments showing that standard loss-based fine-tuning methods are not able to achieve high accuracy in this setting, especially at high sparsity targets. To address this issue, we perform a detailed study of knowledge distillation losses for fine-tuning of sparse models. We determine an L2-based distillation approach that we term ‘SquareHead’, which enables accurate recovery even at higher sparsities. Investigating the question of efficient inference, we show that sparse LLMs can be executed faster by taking advantage of sparsity. Specifically, we exhibit end-to-end results showing speedups enabled by sparsity, while recovering accuracy, on the following models and tasks, respectively: T5 for language translation, Whisper for speech translation, and open GPT-type models such as the Mosaic Pre-Trained Transformer (MPT) and Llama-2 models for text generation. In particular, for popular generative tasks, we show for the first time that sparse fine-tuning can reach 75% sparsity without drops in accuracy, and provide notable end-to-end speedups for inference on CPUs. Moreover, we also highlight that sparsity is compatible with other compression approaches, such as quantization.

Publishing Year

2025

Date Published

2025-07-05

Book Title

Enhancing LLM Performance. Efficacy, Fine-Tuning, and Inference Techniques

Publisher

Springer Nature

Acknowledgement

We would like to thank Eugenia Iofinova for useful comments on an earlier version of this draft, and Artur Niederfahrenhorst for useful suggestions regarding fine-tuning on the GSM8k dataset.

Page

83-97

ISBN

9783031857461

ISSN

2522-8021

eISSN

2522-803X

IST-REx-ID

21257

Cite this

Kurtic E, Kuznedelev D, Frantar E, et al. Sparse Fine-Tuning for Inference Acceleration of Large Language Models. In: Passban P, Way A, Rezagholizadeh M, eds. Enhancing LLM Performance. Efficacy, Fine-Tuning, and Inference Techniques. Springer Nature; 2025:83-97. doi:10.1007/978-3-031-85747-8_6

Kurtic, E., Kuznedelev, D., Frantar, E., Goinv, M., Pandit, S., Agarwalla, A., … Alistarh, D.-A. (2025). Sparse Fine-Tuning for Inference Acceleration of Large Language Models. In P. Passban, A. Way, & M. Rezagholizadeh (Eds.), Enhancing LLM Performance. Efficacy, Fine-Tuning, and Inference Techniques (pp. 83–97). Springer Nature. https://doi.org/10.1007/978-3-031-85747-8_6

Kurtic, Eldar, Denis Kuznedelev, Elias Frantar, Michael Goinv, Shubhra Pandit, Abhinav Agarwalla, Tuan Nguyen, Alexandre Marques, Mark Kurtz, and Dan-Adrian Alistarh. “Sparse Fine-Tuning for Inference Acceleration of Large Language Models.” In Enhancing LLM Performance. Efficacy, Fine-Tuning, and Inference Techniques, edited by Peyman Passban, Andy Way, and Mehdi Rezagholizadeh, 83–97. Springer Nature, 2025. https://doi.org/10.1007/978-3-031-85747-8_6.

E. Kurtic et al., “Sparse Fine-Tuning for Inference Acceleration of Large Language Models,” in Enhancing LLM Performance. Efficacy, Fine-Tuning, and Inference Techniques, P. Passban, A. Way, and M. Rezagholizadeh, Eds. Springer Nature, 2025, pp. 83–97.

Kurtic, Eldar, et al. “Sparse Fine-Tuning for Inference Acceleration of Large Language Models.” Enhancing LLM Performance. Efficacy, Fine-Tuning, and Inference Techniques, edited by Peyman Passban et al., Springer Nature, 2025, pp. 83–97, doi:10.1007/978-3-031-85747-8_6.

All files available under the following license(s):

Copyright Statement:

This Item is protected by copyright and/or related rights. [...]

Link(s) to Main File(s)

URL

https://doi.org/10.48550/arXiv.2310.06927

Access Level

Open Access

Export

Marked Publications

Open Data ISTA Research Explorer

Sources

arXiv 2310.06927

Search this title in

Google Scholar
ISBN Search