OPTQ: Accurate post-training quantization for generative pre-trained transformers
Frantar E, Ashkboos S, Hoefler T, Alistarh D-A. 2023. OPTQ: Accurate post-training quantization for generative pre-trained transformers. 11th International Conference on Learning Representations . ICLR: International Conference on Learning Representations.
Download
              
            
            
            
            Conference Paper
            
            
            
            | Published
            
            
              |              English
              
            
          
        Scopus indexed
Author
        Corresponding author has ISTA affiliation
Department
    Abstract
    Generative Pre-trained Transformer models, known as GPT or OPT, set themselves apart through breakthrough performance across complex language modelling tasks, but also by their extremely high computational and storage costs. Specifically, due to their massive size, even inference for large, highly-accurate GPT models may require multiple performant GPUs, which limits the usability of such models. While there is emerging work on relieving this pressure via model compression, the applicability and performance of existing compression techniques is limited by the scale and complexity of GPT models. In this paper, we address this challenge, and propose OPTQ, a new one-shot weight quantization method based on approximate second-order information, that is both highly-accurate and highly-efficient. Specifically, OPTQ can quantize GPT models with 175 billion parameters in approximately four GPU hours, reducing the bitwidth down to 3 or 4 bits per weight, with negligible accuracy degradation relative to the uncompressed baseline. Our method more than doubles the compression gains relative to previously-proposed one-shot quantization methods, preserving accuracy, allowing us for the first time to execute an 175 billion-parameter model inside a single GPU for generative inference. Moreover, we also show that our method can still provide reasonable accuracy in the extreme quantization regime, in which weights are quantized to 2-bit or even ternary quantization levels. We show experimentally that these improvements can be leveraged for end-to-end inference speedups over FP16, of around 3.25x when using high-end GPUs (NVIDIA A100) and 4.5x when using more cost-effective ones (NVIDIA A6000). The implementation is available at https://github.com/IST-DASLab/gptq.
    
  Publishing Year
    
  Date Published
    2023-05-01
  Proceedings Title
    11th International Conference on Learning Representations 
  Publisher
    International Conference on Learning Representations
  Acknowledgement
    Elias Frantar and Dan Alistarh gratefully acknowledge funding from the European Research Council (ERC) under the European Union’s Horizon 2020 programme (grant agreement No. 805223 ScaleML), as well as experimental support from Eldar Kurtic, and from the IST Austria IT department, in particular Stefano Elefante, Andrei Hornoiu, and Alois Schloegl. The work of Saleh Ashkboos and Torsten Hoefler was supported by the PASC DaCeMI project, received EuroHPC-JU funding under grant MAELSTROM, No. 955513. We thank the Swiss National Supercomputing Center (CSCS) for supporting us with compute infrastructure.
  Acknowledged SSUs
    Conference
    
      ICLR: International Conference on Learning Representations
    
  Conference Location
    
      Kigali, Rwanda
    
  Conference Date
    
      2023-05-01 – 2023-05-05
    
  IST-REx-ID
    
  Cite this
Frantar E, Ashkboos S, Hoefler T, Alistarh D-A. OPTQ: Accurate post-training quantization for generative pre-trained transformers. In: 11th International Conference on Learning Representations . International Conference on Learning Representations; 2023.
    Frantar, E., Ashkboos, S., Hoefler, T., & Alistarh, D.-A. (2023). OPTQ: Accurate post-training quantization for generative pre-trained transformers. In 11th International Conference on Learning Representations . Kigali, Rwanda: International Conference on Learning Representations.
    Frantar, Elias, Saleh Ashkboos, Torsten Hoefler, and Dan-Adrian Alistarh. “OPTQ: Accurate Post-Training Quantization for Generative Pre-Trained Transformers.” In 11th International Conference on Learning Representations . International Conference on Learning Representations, 2023.
    E. Frantar, S. Ashkboos, T. Hoefler, and D.-A. Alistarh, “OPTQ: Accurate post-training quantization for generative pre-trained transformers,” in 11th International Conference on Learning Representations , Kigali, Rwanda, 2023.
    Frantar E, Ashkboos S, Hoefler T, Alistarh D-A. 2023. OPTQ: Accurate post-training quantization for generative pre-trained transformers. 11th International Conference on Learning Representations . ICLR: International Conference on Learning Representations.
    Frantar, Elias, et al. “OPTQ: Accurate Post-Training Quantization for Generative Pre-Trained Transformers.” 11th International Conference on Learning Representations , International Conference on Learning Representations, 2023.
  
      All files available under the following license(s):
      
      
        
          
        
          
          
      
      
    
  
            Copyright Statement:
          
        
            This Item is protected by copyright and/or related rights. [...]
          
        
      Main File(s)
    
  File Name
    
        
          
          
            2023_ICLR_Frantar.pdf
          
        
       437.49 KB
    
  Access Level
     Open Access
 Open Access
    Date Uploaded
    
      2024-08-05
    
  MD5 Checksum
    
      aacbf11dbd8b02a3e0bfd942a33e0593
    
  
      Material in ISTA:
    
  
      Dissertation containing ISTA record
    
  

 Google Scholar
Google Scholar