SpQR: A sparse-quantized representation for near-lossless LLM weight compression

Dettmers, Tim; Svirschevski, Ruslan A.; Egiazarian, Vage; Kuznedelev, Denis; Frantar, Elias; Ashkboos, Saleh; Borzunov, Alexander; Hoefler, Torsten; Alistarh, Dan-Adrian

SpQR: A sparse-quantized representation for near-lossless LLM weight compression

Dettmers T, Svirschevski RA, Egiazarian V, Kuznedelev D, Frantar E, Ashkboos S, Borzunov A, Hoefler T, Alistarh D-A. 2024. SpQR: A sparse-quantized representation for near-lossless LLM weight compression. 12th International Conference on Learning Representations. ICLR: International Conference on Learning Representations.

Download (ext.)

https://doi.org/10.48550/arXiv.2306.03078 [Preprint]

Conference Paper | Published | English

Scopus indexed

Author

Dettmers, Tim; Svirschevski, Ruslan A.; Egiazarian, Vage; Kuznedelev, Denis; Frantar, Elias^ISTA; Ashkboos, Saleh; Borzunov, Alexander; Hoefler, Torsten; Alistarh, Dan-Adrian^ISTA

Department

Alistarh Group

Abstract

Recent advances in large language model (LLM) pretraining have led to high-quality LLMs with impressive abilities. By compressing such LLMs via quantization to 3-4 bits per parameter, they can fit into memory-limited devices such as laptops and mobile phones, enabling personalized use. Quantizing models to 3-4 bits per parameter can lead to moderate to high accuracy losses, especially for smaller models (1-10B parameters), which are suitable for edge deployment. To address this accuracy issue, we introduce the Sparse-Quantized Representation (SpQR), a new compressed format and quantization technique that enables for the first time \emph{near-lossless} compression of LLMs across model scales while reaching similar compression levels to previous methods. SpQR works by identifying and isolating \emph{outlier weights}, which cause particularly large quantization errors, and storing them in higher precision while compressing all other weights to 3-4 bits, and achieves relative accuracy losses of less than in perplexity for highly-accurate LLaMA and Falcon LLMs. This makes it possible to run a 33B parameter LLM on a single 24 GB consumer GPU without performance degradation at 15% speedup, thus making powerful LLMs available to consumers without any downsides. SpQR comes with efficient algorithms for both encoding weights into its format, as well as decoding them efficiently at runtime. Specifically, we provide an efficient GPU inference algorithm for SpQR, which yields faster inference than 16-bit baselines at similar accuracy while enabling memory compression gains of more than 4x.

Publishing Year

2024

Date Published

2024-05-15

Proceedings Title

12th International Conference on Learning Representations

Publisher

OpenReview

Acknowledgement

Denis Kuznedelev acknowledges the support from the Russian Ministry of Science and Higher Education, grant No. 075-10-2021-068. Ruslan Svirschevski and Vage Egiazarian and Denis Kuznedelev were supported by the grant for research centers in the field of AI provided by the Analytical Center for the Government of the Russian Federation (ACRF) in accordance with the agreement on the provision of subsidies (identifier of the agreement 000000D730321P5Q0002) and the agreement with HSE University No. 70-2021-00139.

Conference

ICLR: International Conference on Learning Representations

Conference Location

Vienna, Austria

Conference Date

2024-05-07 – 2024-05-11

IST-REx-ID

18977

Cite this

Dettmers T, Svirschevski RA, Egiazarian V, et al. SpQR: A sparse-quantized representation for near-lossless LLM weight compression. In: 12th International Conference on Learning Representations. OpenReview; 2024.

Dettmers, T., Svirschevski, R. A., Egiazarian, V., Kuznedelev, D., Frantar, E., Ashkboos, S., … Alistarh, D.-A. (2024). SpQR: A sparse-quantized representation for near-lossless LLM weight compression. In 12th International Conference on Learning Representations. Vienna, Austria: OpenReview.

Dettmers, Tim, Ruslan A. Svirschevski, Vage Egiazarian, Denis Kuznedelev, Elias Frantar, Saleh Ashkboos, Alexander Borzunov, Torsten Hoefler, and Dan-Adrian Alistarh. “SpQR: A Sparse-Quantized Representation for near-Lossless LLM Weight Compression.” In 12th International Conference on Learning Representations. OpenReview, 2024.

T. Dettmers et al., “SpQR: A sparse-quantized representation for near-lossless LLM weight compression,” in 12th International Conference on Learning Representations, Vienna, Austria, 2024.

Dettmers, Tim, et al. “SpQR: A Sparse-Quantized Representation for near-Lossless LLM Weight Compression.” 12th International Conference on Learning Representations, OpenReview, 2024.

All files available under the following license(s):

Copyright Statement:

This Item is protected by copyright and/or related rights. [...]

Link(s) to Main File(s)

URL

https://doi.org/10.48550/arXiv.2306.03078

Access Level

Open Access

Export

Marked Publications

Open Data ISTA Research Explorer

Sources

arXiv 2306.03078

Search this title in

Google Scholar