{"citation":{"short":"E. Kurtic, A. Marques, S. Pandit, M. Kurtz, D.-A. Alistarh, in:, Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, 2025, pp. 26872–26886.","chicago":"Kurtic, Eldar, Alexandre Marques, Shubhra Pandit, Mark Kurtz, and Dan-Adrian Alistarh. “‘Give Me BF16 or Give Me Death’? Accuracy-Performance Trade-Offs in LLM Quantization.” In <i>Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics</i>, 26872–86. Association for Computational Linguistics, 2025.","ista":"Kurtic E, Marques A, Pandit S, Kurtz M, Alistarh D-A. 2025. “Give me BF16 or give me death”? Accuracy-performance trade-offs in LLM quantization. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics. ACL: Meeting of the Association for Computational Linguistics, 26872–26886.","mla":"Kurtic, Eldar, et al. “‘Give Me BF16 or Give Me Death’? Accuracy-Performance Trade-Offs in LLM Quantization.” <i>Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics</i>, Association for Computational Linguistics, 2025, pp. 26872–86.","ama":"Kurtic E, Marques A, Pandit S, Kurtz M, Alistarh D-A. “Give me BF16 or give me death”? Accuracy-performance trade-offs in LLM quantization. In: <i>Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics</i>. Association for Computational Linguistics; 2025:26872-26886.","apa":"Kurtic, E., Marques, A., Pandit, S., Kurtz, M., &#38; Alistarh, D.-A. (2025). “Give me BF16 or give me death”? Accuracy-performance trade-offs in LLM quantization. In <i>Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics</i> (pp. 26872–26886). Vienna, Austria: Association for Computational Linguistics.","ieee":"E. Kurtic, A. Marques, S. Pandit, M. Kurtz, and D.-A. Alistarh, “‘Give me BF16 or give me death’? Accuracy-performance trade-offs in LLM quantization,” in <i>Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics</i>, Vienna, Austria, 2025, pp. 26872–26886."},"OA_type":"gold","scopus_import":"1","date_created":"2025-11-24T14:20:46Z","date_published":"2025-08-01T00:00:00Z","ddc":["000"],"corr_author":"1","publisher":"Association for Computational Linguistics","type":"conference","status":"public","user_id":"2DF688A6-F248-11E8-B48F-1D18A9856A87","date_updated":"2025-11-26T11:15:11Z","OA_place":"publisher","file":[{"relation":"main_file","content_type":"application/pdf","access_level":"open_access","file_size":417450,"file_id":"20698","date_created":"2025-11-26T11:06:57Z","date_updated":"2025-11-26T11:06:57Z","file_name":"2025_ACL_Kurtic.pdf","success":1,"creator":"dernst","checksum":"4c066ee20f9ab17619c95652c0eb75f1"}],"publication_status":"published","month":"08","year":"2025","day":"01","publication_identifier":{"isbn":["9798891762510"],"issn":["0736-587X"]},"external_id":{"arxiv":["2411.02355"]},"department":[{"_id":"DaAl"}],"page":"26872-26886","title":"“Give me BF16 or give me death”? Accuracy-performance trade-offs in LLM quantization","_id":"20684","author":[{"first_name":"Eldar","full_name":"Kurtic, Eldar","last_name":"Kurtic","id":"47beb3a5-07b5-11eb-9b87-b108ec578218"},{"first_name":"Alexandre","full_name":"Marques, Alexandre","last_name":"Marques"},{"full_name":"Pandit, Shubhra","last_name":"Pandit","first_name":"Shubhra"},{"first_name":"Mark","full_name":"Kurtz, Mark","last_name":"Kurtz"},{"first_name":"Dan-Adrian","orcid":"0000-0003-3650-940X","id":"4A899BFC-F248-11E8-B48F-1D18A9856A87","last_name":"Alistarh","full_name":"Alistarh, Dan-Adrian"}],"publication":"Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics","tmp":{"short":"CC BY (4.0)","name":"Creative Commons Attribution 4.0 International Public License (CC-BY 4.0)","legal_code_url":"https://creativecommons.org/licenses/by/4.0/legalcode","image":"/images/cc_by.png"},"abstract":[{"lang":"eng","text":"Quantization is a powerful tool for accelerating large language model (LLM) inference, but the accuracy-performance trade-offs across different formats remain unclear. In this paper, we conduct the most comprehensive empirical study to date, evaluating FP8, INT8, and INT4\r\nquantization across academic benchmarks and real-world tasks on the entire Llama-3.1 model\r\nfamily. Through over 500,000 evaluations, our investigation yields several key findings: (1) FP8 (W8A8-FP) is effectively lossless across all model scales, (2) well-tuned INT8 (W8A8-INT) achieves surprisingly low (1-3%) accuracy degradation, and (3) INT4 weightonly (W4A16-INT) is more competitive than expected, rivaling 8-bit quantization. Further, we investigate the optimal quantization format for different deployments by analyzing inference performance through the popular vLLM framework. Our analysis provides clear deployment recommendations: W4A16 is the most cost-efficient for synchronous setups, while W8A8 dominates in asynchronous\r\ncontinuous batching. For mixed workloads, the optimal choice depends on the specific use\r\ncase. Our findings offer practical, data-driven guidelines for deploying quantized LLMs at scale—ensuring the best balance between speed, efficiency, and accuracy. "}],"has_accepted_license":"1","oa":1,"article_processing_charge":"No","language":[{"iso":"eng"}],"quality_controlled":"1","oa_version":"Published Version","arxiv":1,"conference":{"start_date":"2025-07-27","end_date":"2025-08-01","name":"ACL: Meeting of the Association for Computational Linguistics","location":"Vienna, Austria"},"file_date_updated":"2025-11-26T11:06:57Z"}