MARLIN: Mixed-precision auto-regressive parallel inference on Large Language Models

Frantar E, Castro RL, Chen J, Hoefler T, Alistarh D-A. 2025. MARLIN: Mixed-precision auto-regressive parallel inference on Large Language Models. Proceedings of the 30th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming. PPoPP: Symposium on Principles and Practice of Parallel Programming, 239–251.

Download
OA 2025_PPoPP_Frantar.pdf 1.33 MB [Published Version]

Conference Paper | Published | English

Scopus indexed
Author
Frantar, EliasISTA; Castro, Roberto L.; Chen, JialeISTA ; Hoefler, Torsten; Alistarh, Dan-AdrianISTA

Corresponding author has ISTA affiliation

Department
Abstract
As inference on Large Language Models (LLMs) emerges as an important workload in machine learning applications, model weight quantization has become a standard technique for efficient GPU deployment. Quantization not only reduces model size, but has also been shown to yield substantial speedups for single-user inference, due to reduced memory movement, with low accuracy impact. Yet, it remains a key open question whether speedups are achievable also in batched settings with multiple parallel clients, which are highly relevant for practical serving. It is unclear whether GPU kernels can be designed to remain practically memory-bound, while supporting the substantially increased compute requirements of batched workloads. In this paper, we resolve this question positively by introducing a new design for Mixed-precision Auto-Regressive LINear kernels, called MARLIN. Concretely, given a model whose weights are compressed via quantization to, e.g., 4 bits per element, MARLIN shows that batchsizes up to 16-32 can be practically supported with close to maximum (4×) quantization speedup, and larger batchsizes up to 64-128 with gradually decreasing, but still significant, acceleration. MARLIN accomplishes this via a combination of techniques, such as asynchronous memory access, complex task scheduling and pipelining, and bespoke quantization support. Our experiments show that MARLIN's near-optimal performance on individual LLM layers across different scenarios can also lead to significant end-to-end LLM inference speedups (of up to 2.8×) when integrated with the popular vLLM open-source serving engine. Finally, we show that MARLIN is extensible to further compression techniques, like NVIDIA 2:4 sparsity, leading to additional speedups.
Publishing Year
Date Published
2025-02-28
Proceedings Title
Proceedings of the 30th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming
Publisher
Association for Computing Machinery
Acknowledgement
The authors would like to thank the Neural Magic team, in particular Michael Goin, Alexander Matveev, and Rob Shaw, for support with the vLLM integration. This research was supported in part by generous grants from NVIDIA and Google.
Page
239-251
Conference
PPoPP: Symposium on Principles and Practice of Parallel Programming
Conference Location
Las Vegas, NV, United States
Conference Date
2025-03-01 – 2025-03-05
IST-REx-ID

Cite this

Frantar E, Castro RL, Chen J, Hoefler T, Alistarh D-A. MARLIN: Mixed-precision auto-regressive parallel inference on Large Language Models. In: Proceedings of the 30th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming. Association for Computing Machinery; 2025:239-251. doi:10.1145/3710848.3710871
Frantar, E., Castro, R. L., Chen, J., Hoefler, T., & Alistarh, D.-A. (2025). MARLIN: Mixed-precision auto-regressive parallel inference on Large Language Models. In Proceedings of the 30th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming (pp. 239–251). Las Vegas, NV, United States: Association for Computing Machinery. https://doi.org/10.1145/3710848.3710871
Frantar, Elias, Roberto L. Castro, Jiale Chen, Torsten Hoefler, and Dan-Adrian Alistarh. “MARLIN: Mixed-Precision Auto-Regressive Parallel Inference on Large Language Models.” In Proceedings of the 30th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, 239–51. Association for Computing Machinery, 2025. https://doi.org/10.1145/3710848.3710871.
E. Frantar, R. L. Castro, J. Chen, T. Hoefler, and D.-A. Alistarh, “MARLIN: Mixed-precision auto-regressive parallel inference on Large Language Models,” in Proceedings of the 30th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, Las Vegas, NV, United States, 2025, pp. 239–251.
Frantar E, Castro RL, Chen J, Hoefler T, Alistarh D-A. 2025. MARLIN: Mixed-precision auto-regressive parallel inference on Large Language Models. Proceedings of the 30th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming. PPoPP: Symposium on Principles and Practice of Parallel Programming, 239–251.
Frantar, Elias, et al. “MARLIN: Mixed-Precision Auto-Regressive Parallel Inference on Large Language Models.” Proceedings of the 30th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, Association for Computing Machinery, 2025, pp. 239–51, doi:10.1145/3710848.3710871.
All files available under the following license(s):
Creative Commons Attribution 4.0 International Public License (CC-BY 4.0):
Main File(s)
File Name
Access Level
OA Open Access
Date Uploaded
2025-06-24
MD5 Checksum
a0566ea3c168e8273501a5eb7d767cf8


Export

Marked Publications

Open Data ISTA Research Explorer

Sources

arXiv 2408.11743

Search this title in

Google Scholar
ISBN Search