QMoE: Sub-1-bit compression of trillion parameter models

Frantar, Elias; Alistarh, Dan-Adrian

QMoE: Sub-1-bit compression of trillion parameter models

Frantar E, Alistarh D-A. 2024. QMoE: Sub-1-bit compression of trillion parameter models. Proceedings of Machine Learning and Systems. MLSys: Machine Learning and Systems vol. 6.

Download (ext.)

https://proceedings.mlsys.org/paper_files/paper/2024/hash/c74b624843218d9b6713fc[...] [Published Version]

Conference Paper | Published | English

Author

Frantar, Elias^ISTA; Alistarh, Dan-Adrian^ISTA

Editor

Gibbons, P.; Pekhimenko, G.; De Sa, C.

Corresponding author has ISTA affiliation

Department

Alistarh Group

Abstract

Mixture-of-Experts (MoE) architectures offer a general solution to the high inference costs of large language models (LLMs) via sparse routing, bringing faster and more accurate models, at the cost of massive parameter counts. For example, the SwitchTransformer-c2048 model has 1.6 trillion parameters, requiring 3.2TB of accelerator memory to run efficiently, which makes practical deployment challenging and expensive. In this paper, we present a solution to this memory problem, in form of a new compression and execution framework called QMoE. Specifically, QMoE consists of a scalable algorithm which accurately compresses trillion-parameter MoEs to less than 1 bit per parameter, in a custom format co-designed with bespoke GPU decoding kernels to facilitate efficient end-to-end compressed inference, with minor runtime overheads relative to uncompressed execution. Concretely, QMoE can compress the 1.6 trillion parameter SwitchTransformer-c2048 model to less than 160GB (20x compression, 0.8 bits per parameter) at only minor accuracy loss, in less than a day on a single GPU. This enables, for the first time, the execution of a trillion-parameter model on affordable commodity hardware, like a single server with 4x NVIDIA A6000 or 8x NVIDIA 3090 GPUs, at less than 5% runtime overhead relative to ideal uncompressed inference. The anonymized code is available at: github.com/mlsys24-qmoe/qmoe.

Publishing Year

2024

Date Published

2024-05-01

Proceedings Title

Proceedings of Machine Learning and Systems

Volume

Conference

MLSys: Machine Learning and Systems

Conference Location

Santa Clara, CA, USA

Conference Date

2024-05-13 – 2024-05-16

IST-REx-ID

18061

Cite this

Frantar E, Alistarh D-A. QMoE: Sub-1-bit compression of trillion parameter models. In: Gibbons P, Pekhimenko G, De Sa C, eds. Proceedings of Machine Learning and Systems. Vol 6. ; 2024.

Frantar, E., & Alistarh, D.-A. (2024). QMoE: Sub-1-bit compression of trillion parameter models. In P. Gibbons, G. Pekhimenko, & C. De Sa (Eds.), Proceedings of Machine Learning and Systems (Vol. 6). Santa Clara, CA, USA.

Frantar, Elias, and Dan-Adrian Alistarh. “QMoE: Sub-1-Bit Compression of Trillion Parameter Models.” In Proceedings of Machine Learning and Systems, edited by P. Gibbons, G. Pekhimenko, and C. De Sa, Vol. 6, 2024.

E. Frantar and D.-A. Alistarh, “QMoE: Sub-1-bit compression of trillion parameter models,” in Proceedings of Machine Learning and Systems, Santa Clara, CA, USA, 2024, vol. 6.

Frantar E, Alistarh D-A. 2024. QMoE: Sub-1-bit compression of trillion parameter models. Proceedings of Machine Learning and Systems. MLSys: Machine Learning and Systems vol. 6.

Frantar, Elias, and Dan-Adrian Alistarh. “QMoE: Sub-1-Bit Compression of Trillion Parameter Models.” Proceedings of Machine Learning and Systems, edited by P. Gibbons et al., vol. 6, 2024.

All files available under the following license(s):

Copyright Statement:

This Item is protected by copyright and/or related rights. [...]

Link(s) to Main File(s)

URL

https://proceedings.mlsys.org/paper_files/paper/2024/hash/c74b624843218d9b6713fcf299d6d5e4-Abstract-Conference.html

Access Level

Open Access

Material in ISTA:

Dissertation containing ISTA record

Compressing large neural networks : Algorithms, systems and scaling laws

Export

Marked Publications

Open Data ISTA Research Explorer

Search this title in

Google Scholar