{"date_published":"2024-09-01T00:00:00Z","author":[{"full_name":"Egiazarian, Vage","last_name":"Egiazarian","first_name":"Vage"},{"last_name":"Panferov","full_name":"Panferov, Andrei","first_name":"Andrei","id":"2c18daae-4dbe-11ef-8491-98ce2d960f09"},{"full_name":"Kuznedelev, Denis","last_name":"Kuznedelev","first_name":"Denis"},{"full_name":"Frantar, Elias","last_name":"Frantar","id":"09a8f98d-ec99-11ea-ae11-c063a7b7fe5f","first_name":"Elias"},{"full_name":"Babenko, Artem","last_name":"Babenko","first_name":"Artem"},{"orcid":"0000-0003-3650-940X","id":"4A899BFC-F248-11E8-B48F-1D18A9856A87","first_name":"Dan-Adrian","full_name":"Alistarh, Dan-Adrian","last_name":"Alistarh"}],"publication":"Proceedings of the 41st International Conference on Machine Learning","department":[{"_id":"DaAl"},{"_id":"GradSch"}],"corr_author":"1","month":"09","conference":{"end_date":"2024-07-27","location":"Vienna, Austria","name":"ICML: International Conference on Machine Learning","start_date":"2024-07-21"},"title":"Extreme compression of large language models via additive quantization","year":"2024","abstract":[{"text":"The emergence of accurate open large language models (LLMs) has led to a race towards performant quantization techniques which can enable their execution on end-user devices. In this paper, we revisit the problem of “extreme” LLM compression—defined as targeting extremely low bit counts, such as 2 to 3 bits per parameter—from the point of view of classic methods in Multi-Codebook Quantization (MCQ). Our algorithm, called AQLM, generalizes the classic Additive Quantization (AQ) approach for information retrieval to advance the state-of-the-art in LLM compression, via two innovations: 1) learned additive quantization of weight matrices in input-adaptive fashion, and 2) joint optimization of codebook parameters across each transformer blocks. Broadly, AQLM is the first scheme that is Pareto optimal in terms of accuracy-vs-model-size when compressing to less than 3 bits per parameter, and significantly improves upon all known schemes in the extreme compression (2bit) regime. In addition, AQLM is practical: we provide fast GPU and CPU implementations of AQLM for token generation, which enable us to match or outperform optimized FP16 implementations for speed, while executing in a much smaller memory footprint.","lang":"eng"}],"language":[{"iso":"eng"}],"type":"conference","date_updated":"2024-10-01T08:13:05Z","page":"12284-12303","oa_version":"Preprint","date_created":"2024-09-22T22:01:43Z","intvolume":" 235","_id":"18113","scopus_import":"1","publication_status":"published","day":"01","alternative_title":["PMLR"],"article_processing_charge":"No","status":"public","quality_controlled":"1","oa":1,"acknowledgement":"Authors would like to thank Ruslan Svirschevski for his help in solving technical issues with AQLM and baselines. We also thank Tim Dettmers for helpful discussions on the structure of weights in modern LLMs and size-accuracy trade-offs. The authors would also like to thank Daniil Pavlov for his assistance with CPU benchmarking. Finally, authors would like to thank the communities of ML enthusiasts known as LocalLLaMA5 and Petals community on discord6\r\nfor the crowd wisdom about running LLMs on consumer devices. Egiazarian Vage and Denis Kuznedelev and Andrei Panferov were supported by the grant for research centers in the field of AI provided by the Analytical Center for the Government of the Russian Federation (ACRF) in\r\naccordance with the agreement on the provision of subsidies (identifier of the agreement 000000D730321P5Q0002) and the agreement with HSE University No. 70-2021-00139.","external_id":{"arxiv":["2401.06118"]},"main_file_link":[{"url":" https://doi.org/10.48550/arXiv.2401.06118","open_access":"1"}],"publication_identifier":{"eissn":["2640-3498"]},"user_id":"2DF688A6-F248-11E8-B48F-1D18A9856A87","volume":235,"citation":{"apa":"Egiazarian, V., Panferov, A., Kuznedelev, D., Frantar, E., Babenko, A., & Alistarh, D.-A. (2024). Extreme compression of large language models via additive quantization. In Proceedings of the 41st International Conference on Machine Learning (Vol. 235, pp. 12284–12303). Vienna, Austria: ML Research Press.","ista":"Egiazarian V, Panferov A, Kuznedelev D, Frantar E, Babenko A, Alistarh D-A. 2024. Extreme compression of large language models via additive quantization. Proceedings of the 41st International Conference on Machine Learning. ICML: International Conference on Machine Learning, PMLR, vol. 235, 12284–12303.","mla":"Egiazarian, Vage, et al. “Extreme Compression of Large Language Models via Additive Quantization.” Proceedings of the 41st International Conference on Machine Learning, vol. 235, ML Research Press, 2024, pp. 12284–303.","short":"V. Egiazarian, A. Panferov, D. Kuznedelev, E. Frantar, A. Babenko, D.-A. Alistarh, in:, Proceedings of the 41st International Conference on Machine Learning, ML Research Press, 2024, pp. 12284–12303.","ama":"Egiazarian V, Panferov A, Kuznedelev D, Frantar E, Babenko A, Alistarh D-A. Extreme compression of large language models via additive quantization. In: Proceedings of the 41st International Conference on Machine Learning. Vol 235. ML Research Press; 2024:12284-12303.","ieee":"V. Egiazarian, A. Panferov, D. Kuznedelev, E. Frantar, A. Babenko, and D.-A. Alistarh, “Extreme compression of large language models via additive quantization,” in Proceedings of the 41st International Conference on Machine Learning, Vienna, Austria, 2024, vol. 235, pp. 12284–12303.","chicago":"Egiazarian, Vage, Andrei Panferov, Denis Kuznedelev, Elias Frantar, Artem Babenko, and Dan-Adrian Alistarh. “Extreme Compression of Large Language Models via Additive Quantization.” In Proceedings of the 41st International Conference on Machine Learning, 235:12284–303. ML Research Press, 2024."},"publisher":"ML Research Press"}