“Give me BF16 or give me death”? Accuracy-performance trade-offs in LLM quantization

Kurtic E, Marques A, Pandit S, Kurtz M, Alistarh D-A. 2025. “Give me BF16 or give me death”? Accuracy-performance trade-offs in LLM quantization. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics. ACL: Meeting of the Association for Computational Linguistics, 26872–26886.

Download
OA 2025_ACL_Kurtic.pdf 417.45 KB [Published Version]
Conference Paper | Published | English

Scopus indexed
Author
Kurtic, EldarISTA; Marques, Alexandre; Pandit, Shubhra; Kurtz, Mark; Alistarh, Dan-AdrianISTA

Corresponding author has ISTA affiliation

Department
Abstract
Quantization is a powerful tool for accelerating large language model (LLM) inference, but the accuracy-performance trade-offs across different formats remain unclear. In this paper, we conduct the most comprehensive empirical study to date, evaluating FP8, INT8, and INT4 quantization across academic benchmarks and real-world tasks on the entire Llama-3.1 model family. Through over 500,000 evaluations, our investigation yields several key findings: (1) FP8 (W8A8-FP) is effectively lossless across all model scales, (2) well-tuned INT8 (W8A8-INT) achieves surprisingly low (1-3%) accuracy degradation, and (3) INT4 weightonly (W4A16-INT) is more competitive than expected, rivaling 8-bit quantization. Further, we investigate the optimal quantization format for different deployments by analyzing inference performance through the popular vLLM framework. Our analysis provides clear deployment recommendations: W4A16 is the most cost-efficient for synchronous setups, while W8A8 dominates in asynchronous continuous batching. For mixed workloads, the optimal choice depends on the specific use case. Our findings offer practical, data-driven guidelines for deploying quantized LLMs at scale—ensuring the best balance between speed, efficiency, and accuracy.
Publishing Year
Date Published
2025-08-01
Proceedings Title
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics
Publisher
Association for Computational Linguistics
Page
26872-26886
Conference
ACL: Meeting of the Association for Computational Linguistics
Conference Location
Vienna, Austria
Conference Date
2025-07-27 – 2025-08-01
ISSN
IST-REx-ID

Cite this

Kurtic E, Marques A, Pandit S, Kurtz M, Alistarh D-A. “Give me BF16 or give me death”? Accuracy-performance trade-offs in LLM quantization. In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics; 2025:26872-26886.
Kurtic, E., Marques, A., Pandit, S., Kurtz, M., & Alistarh, D.-A. (2025). “Give me BF16 or give me death”? Accuracy-performance trade-offs in LLM quantization. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (pp. 26872–26886). Vienna, Austria: Association for Computational Linguistics.
Kurtic, Eldar, Alexandre Marques, Shubhra Pandit, Mark Kurtz, and Dan-Adrian Alistarh. “‘Give Me BF16 or Give Me Death’? Accuracy-Performance Trade-Offs in LLM Quantization.” In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, 26872–86. Association for Computational Linguistics, 2025.
E. Kurtic, A. Marques, S. Pandit, M. Kurtz, and D.-A. Alistarh, “‘Give me BF16 or give me death’? Accuracy-performance trade-offs in LLM quantization,” in Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, Vienna, Austria, 2025, pp. 26872–26886.
Kurtic E, Marques A, Pandit S, Kurtz M, Alistarh D-A. 2025. “Give me BF16 or give me death”? Accuracy-performance trade-offs in LLM quantization. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics. ACL: Meeting of the Association for Computational Linguistics, 26872–26886.
Kurtic, Eldar, et al. “‘Give Me BF16 or Give Me Death’? Accuracy-Performance Trade-Offs in LLM Quantization.” Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, 2025, pp. 26872–86.
All files available under the following license(s):
Creative Commons Attribution 4.0 International Public License (CC-BY 4.0):
Main File(s)
File Name
Access Level
OA Open Access
Date Uploaded
2025-11-26
MD5 Checksum
4c066ee20f9ab17619c95652c0eb75f1


Export

Marked Publications

Open Data ISTA Research Explorer

Sources

arXiv 2411.02355

Search this title in

Google Scholar
ISBN Search