“Give me BF16 or give me death”? Accuracy-performance trade-offs in LLM quantization

Kurtic, Eldar; Marques, Alexandre; Pandit, Shubhra; Kurtz, Mark; Alistarh, Dan-Adrian

“Give me BF16 or give me death”? Accuracy-performance trade-offs in LLM quantization

Kurtic E, Marques A, Pandit S, Kurtz M, Alistarh D-A. 2025. “Give me BF16 or give me death”? Accuracy-performance trade-offs in LLM quantization. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics. ACL: Meeting of the Association for Computational Linguistics, 26872–26886.

Download

2025_ACL_Kurtic.pdf 417.45 KB [Published Version]

Conference Paper | Published | English

Scopus indexed

Author

Kurtic, Eldar^ISTA; Marques, Alexandre; Pandit, Shubhra; Kurtz, Mark; Alistarh, Dan-Adrian^ISTA

Corresponding author has ISTA affiliation

Department

Alistarh Group

Abstract

Quantization is a powerful tool for accelerating large language model (LLM) inference, but the accuracy-performance trade-offs across different formats remain unclear. In this paper, we conduct the most comprehensive empirical study to date, evaluating FP8, INT8, and INT4 quantization across academic benchmarks and real-world tasks on the entire Llama-3.1 model family. Through over 500,000 evaluations, our investigation yields several key findings: (1) FP8 (W8A8-FP) is effectively lossless across all model scales, (2) well-tuned INT8 (W8A8-INT) achieves surprisingly low (1-3%) accuracy degradation, and (3) INT4 weightonly (W4A16-INT) is more competitive than expected, rivaling 8-bit quantization. Further, we investigate the optimal quantization format for different deployments by analyzing inference performance through the popular vLLM framework. Our analysis provides clear deployment recommendations: W4A16 is the most cost-efficient for synchronous setups, while W8A8 dominates in asynchronous continuous batching. For mixed workloads, the optimal choice depends on the specific use case. Our findings offer practical, data-driven guidelines for deploying quantized LLMs at scale—ensuring the best balance between speed, efficiency, and accuracy.

Publishing Year

2025

Date Published

2025-08-01

Proceedings Title

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics

Publisher

Association for Computational Linguistics

Page

26872-26886

Conference

ACL: Meeting of the Association for Computational Linguistics

Conference Location

Vienna, Austria

Conference Date

2025-07-27 – 2025-08-01

ISBN

9798891762510

ISSN

0736-587X

IST-REx-ID

20684

Cite this

Kurtic E, Marques A, Pandit S, Kurtz M, Alistarh D-A. “Give me BF16 or give me death”? Accuracy-performance trade-offs in LLM quantization. In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics; 2025:26872-26886.

Kurtic, E., Marques, A., Pandit, S., Kurtz, M., & Alistarh, D.-A. (2025). “Give me BF16 or give me death”? Accuracy-performance trade-offs in LLM quantization. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (pp. 26872–26886). Vienna, Austria: Association for Computational Linguistics.

Kurtic, Eldar, Alexandre Marques, Shubhra Pandit, Mark Kurtz, and Dan-Adrian Alistarh. “‘Give Me BF16 or Give Me Death’? Accuracy-Performance Trade-Offs in LLM Quantization.” In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, 26872–86. Association for Computational Linguistics, 2025.

E. Kurtic, A. Marques, S. Pandit, M. Kurtz, and D.-A. Alistarh, “‘Give me BF16 or give me death’? Accuracy-performance trade-offs in LLM quantization,” in Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, Vienna, Austria, 2025, pp. 26872–26886.

Kurtic, Eldar, et al. “‘Give Me BF16 or Give Me Death’? Accuracy-Performance Trade-Offs in LLM Quantization.” Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, 2025, pp. 26872–86.

All files available under the following license(s):

Creative Commons Attribution 4.0 International Public License (CC-BY 4.0):