PV-tuning: Beyond straight-through estimation for extreme LLM compression

Malinovskii, Vladimir; Mazur, Denis; Ilin, Ivan; Kuznedelev, Denis; Burlachenko, Konstantin; Yi, Kai; Alistarh, Dan-Adrian; Richtarik, Peter

PV-tuning: Beyond straight-through estimation for extreme LLM compression

Malinovskii V, Mazur D, Ilin I, Kuznedelev D, Burlachenko K, Yi K, Alistarh D-A, Richtarik P. 2024. PV-tuning: Beyond straight-through estimation for extreme LLM compression. 38th Conference on Neural Information Processing Systems. NeurIPS: Neural Information Processing Systems, Advances in Neural Information Processing Systems, vol. 37.

Download

2024_NeurIPS_Malinovskii.pdf 939.71 KB [Published Version]

Conference Paper | Published | English

Scopus indexed

Author

Malinovskii, Vladimir; Mazur, Denis; Ilin, Ivan; Kuznedelev, Denis; Burlachenko, Konstantin; Yi, Kai; Alistarh, Dan-Adrian^ISTA ; Richtarik, Peter

Department

Alistarh Group

Series Title

Advances in Neural Information Processing Systems

Abstract

There has been significant interest in "extreme" compression of large language models (LLMs), i.e. to 1-2 bits per parameter, which allows such models to be executed efficiently on resource-constrained devices. Existing work focused on improved one-shot quantization techniques and weight representations; yet, purely post-training approaches are reaching diminishing returns in terms of the accuracy-vs-bit-width trade-off. State-of-the-art quantization methods such as QuIP# and AQLM include fine-tuning (part of) the compressed parameters over a limited amount of calibration data; however, such fine-tuning techniques over compressed weights often make exclusive use of straight-through estimators (STE), whose performance is not well-understood in this setting. In this work, we question the use of STE for extreme LLM compression, showing that it can be sub-optimal, and perform a systematic study of quantization-aware fine-tuning strategies for LLMs.We propose PV-Tuning - a representation-agnostic framework that generalizes and improves upon existing fine-tuning strategies, and provides convergence guarantees in restricted cases.On the practical side, when used for 1-2 bit vector quantization, PV-Tuning outperforms prior techniques for highly-performant models such as Llama and Mistral. Using PV-Tuning, we achieve the first Pareto-optimal quantization for Llama-2 family models at 2 bits per parameter.

Publishing Year

2024

Date Published

2024-12-20

Proceedings Title

38th Conference on Neural Information Processing Systems

Publisher

Neural Information Processing Systems Foundation

Acknowledgement

Authors would like to thank Vage Egiazarian, Andrei Panferov and Ruslan Svirschevski for their help and advice on AQLM codebase and running large-scale experiments. We also thank Philip Zmushko and Artem Fedorov for helpful discussions during the early stages of our research. The research of Kai Yi, Konstantin Burlachenko, and Peter Richtárik reported in this publication was supported by funding from King Abdullah University of Science and Technology (KAUST) – Center of Excellence for Generative AI, under award number 5940. We would also like to thank our NeurIPS reviewers for their helpful suggestions, we specifically highlight p3Lv’s suggestions to consider smaller codebook sizes and evaluate PV-Tuning with QuIP#, both of which produced interesting findings. Finally, we thank the open-source contributors from llama.cpp9 and the LocalLlama10 community for discussions and inspirations on practical use cases of quantized language models, and in particular, Yalda Shabanzadeh and Arthur Aardvark for their help with improving the codebase.

Volume

Conference

NeurIPS: Neural Information Processing Systems

Conference Location

Vancouver, Canada

Conference Date

2024-12-10 – 2024-12-15

ISBN

9798331314385

ISSN

1049-5258

IST-REx-ID

19519

Cite this

Malinovskii V, Mazur D, Ilin I, et al. PV-tuning: Beyond straight-through estimation for extreme LLM compression. In: 38th Conference on Neural Information Processing Systems. Vol 37. Neural Information Processing Systems Foundation; 2024.

Malinovskii, V., Mazur, D., Ilin, I., Kuznedelev, D., Burlachenko, K., Yi, K., … Richtarik, P. (2024). PV-tuning: Beyond straight-through estimation for extreme LLM compression. In 38th Conference on Neural Information Processing Systems (Vol. 37). Vancouver, Canada: Neural Information Processing Systems Foundation.

Malinovskii, Vladimir, Denis Mazur, Ivan Ilin, Denis Kuznedelev, Konstantin Burlachenko, Kai Yi, Dan-Adrian Alistarh, and Peter Richtarik. “PV-Tuning: Beyond Straight-through Estimation for Extreme LLM Compression.” In 38th Conference on Neural Information Processing Systems, Vol. 37. Neural Information Processing Systems Foundation, 2024.

V. Malinovskii et al., “PV-tuning: Beyond straight-through estimation for extreme LLM compression,” in 38th Conference on Neural Information Processing Systems, Vancouver, Canada, 2024, vol. 37.

Malinovskii, Vladimir, et al. “PV-Tuning: Beyond Straight-through Estimation for Extreme LLM Compression.” 38th Conference on Neural Information Processing Systems, vol. 37, Neural Information Processing Systems Foundation, 2024.

All files available under the following license(s):

Copyright Statement: