PV-tuning: Beyond straight-through estimation for extreme LLM compression
Malinovskii V, Mazur D, Ilin I, Kuznedelev D, Burlachenko K, Yi K, Alistarh D-A, Richtarik P. 2024. PV-tuning: Beyond straight-through estimation for extreme LLM compression. 38th Conference on Neural Information Processing Systems. NeurIPS: Neural Information Processing Systems, Advances in Neural Information Processing Systems, vol. 37.
Download
Conference Paper
| Published
| English
Scopus indexed
Author
Malinovskii, Vladimir;
Mazur, Denis;
Ilin, Ivan;
Kuznedelev, Denis;
Burlachenko, Konstantin;
Yi, Kai;
Alistarh, Dan-AdrianISTA
;
Richtarik, Peter

Department
Series Title
Advances in Neural Information Processing Systems
Abstract
There has been significant interest in "extreme" compression of large language models (LLMs), i.e. to 1-2 bits per parameter, which allows such models to be executed efficiently on resource-constrained devices. Existing work focused on improved one-shot quantization techniques and weight representations; yet, purely post-training approaches are reaching diminishing returns in terms of the accuracy-vs-bit-width trade-off. State-of-the-art quantization methods such as QuIP# and AQLM include fine-tuning (part of) the compressed parameters over a limited amount of calibration data; however, such fine-tuning techniques over compressed weights often make exclusive use of straight-through estimators (STE), whose performance is not well-understood in this setting. In this work, we question the use of STE for extreme LLM compression, showing that it can be sub-optimal, and perform a systematic study of quantization-aware fine-tuning strategies for LLMs.We propose PV-Tuning - a representation-agnostic framework that generalizes and improves upon existing fine-tuning strategies, and provides convergence guarantees in restricted cases.On the practical side, when used for 1-2 bit vector quantization, PV-Tuning outperforms prior techniques for highly-performant models such as Llama and Mistral. Using PV-Tuning, we achieve the first Pareto-optimal quantization for Llama-2 family models at 2 bits per parameter.
Publishing Year
Date Published
2024-12-20
Proceedings Title
38th Conference on Neural Information Processing Systems
Publisher
Neural Information Processing Systems Foundation
Acknowledgement
Authors would like to thank Vage Egiazarian, Andrei Panferov and Ruslan Svirschevski for their
help and advice on AQLM codebase and running large-scale experiments. We also thank Philip
Zmushko and Artem Fedorov for helpful discussions during the early stages of our research. The research of Kai Yi, Konstantin Burlachenko, and Peter Richtárik reported in this publication was supported by funding from King Abdullah University of Science and Technology (KAUST) – Center of Excellence for Generative AI, under award number 5940. We would also like to thank our NeurIPS reviewers for their helpful suggestions, we specifically highlight p3Lv’s suggestions to consider smaller codebook sizes and evaluate PV-Tuning with QuIP#, both of which produced interesting findings. Finally, we thank the open-source contributors from llama.cpp9 and the LocalLlama10 community for discussions and inspirations on practical use cases of quantized language models, and in particular, Yalda Shabanzadeh and Arthur Aardvark for their help with improving the codebase.
Volume
37
Conference
NeurIPS: Neural Information Processing Systems
Conference Location
Vancouver, Canada
Conference Date
2024-12-10 – 2024-12-15
ISBN
ISSN
IST-REx-ID
Cite this
Malinovskii V, Mazur D, Ilin I, et al. PV-tuning: Beyond straight-through estimation for extreme LLM compression. In: 38th Conference on Neural Information Processing Systems. Vol 37. Neural Information Processing Systems Foundation; 2024.
Malinovskii, V., Mazur, D., Ilin, I., Kuznedelev, D., Burlachenko, K., Yi, K., … Richtarik, P. (2024). PV-tuning: Beyond straight-through estimation for extreme LLM compression. In 38th Conference on Neural Information Processing Systems (Vol. 37). Vancouver, Canada: Neural Information Processing Systems Foundation.
Malinovskii, Vladimir, Denis Mazur, Ivan Ilin, Denis Kuznedelev, Konstantin Burlachenko, Kai Yi, Dan-Adrian Alistarh, and Peter Richtarik. “PV-Tuning: Beyond Straight-through Estimation for Extreme LLM Compression.” In 38th Conference on Neural Information Processing Systems, Vol. 37. Neural Information Processing Systems Foundation, 2024.
V. Malinovskii et al., “PV-tuning: Beyond straight-through estimation for extreme LLM compression,” in 38th Conference on Neural Information Processing Systems, Vancouver, Canada, 2024, vol. 37.
Malinovskii V, Mazur D, Ilin I, Kuznedelev D, Burlachenko K, Yi K, Alistarh D-A, Richtarik P. 2024. PV-tuning: Beyond straight-through estimation for extreme LLM compression. 38th Conference on Neural Information Processing Systems. NeurIPS: Neural Information Processing Systems, Advances in Neural Information Processing Systems, vol. 37.
Malinovskii, Vladimir, et al. “PV-Tuning: Beyond Straight-through Estimation for Extreme LLM Compression.” 38th Conference on Neural Information Processing Systems, vol. 37, Neural Information Processing Systems Foundation, 2024.
All files available under the following license(s):
Copyright Statement:
This Item is protected by copyright and/or related rights. [...]
Main File(s)
File Name
2024_NeurIPS_Malinovskii.pdf
939.71 KB
Access Level

Date Uploaded
2025-04-07
MD5 Checksum
54d36f947887e26d0e568b512167001a
Export
Marked PublicationsOpen Data ISTA Research Explorer
Sources
arXiv 2405.14852