{"month":"09","department":[{"_id":"GradSch"},{"_id":"DaAl"}],"date_updated":"2024-10-09T21:07:11Z","degree_awarded":"PhD","day":"05","title":"Compressing large neural networks : Algorithms, systems and scaling laws","year":"2024","language":[{"iso":"eng"}],"file":[{"access_level":"closed","file_name":"thesis-final.zip","relation":"source_file","checksum":"5d785645805a78c5b4ce7cc3df557b09","file_size":1615167,"file_id":"17570","date_created":"2024-09-05T12:04:11Z","date_updated":"2024-09-05T12:04:11Z","content_type":"application/zip","creator":"efrantar"},{"file_name":"frantar_thesis_final.pdf","access_level":"open_access","file_size":2376611,"checksum":"a9dd1c2d23734986924eb44ebb55fd8f","relation":"main_file","date_updated":"2024-09-06T16:24:59Z","date_created":"2024-09-06T16:24:59Z","file_id":"17880","creator":"efrantar","success":1,"content_type":"application/pdf"}],"article_processing_charge":"No","corr_author":"1","abstract":[{"text":"Large language models (LLMs) have made tremendous progress in the past few years, from being able to generate coherent text to matching or surpassing humans in a wide variety of creative, knowledge or reasoning tasks. Much of this can be attributed to massively increased scale, both in the size of the model as well as the amount of training data, from 100s of millions to 100s of billions, or even trillions. This trend is expected to continue, which, although exciting, also raises major practical concerns. Already today's 100+ billion parameter LLMs require top-of-the-line hardware just to run. Hence, it is clear that sustaining these developments will require significant efficiency advances.\r\n\r\nHistorically, one of the most practical ways of improving model efficiency has been compression, especially in the form of sparsity or quantization. While this has been studied extensively in the past, existing accurate methods are all designed for models around 100 million parameters; scaling them up to ones literally 1000x larger is highly challenging. In this thesis, we introduce a new unified sparsification and quantization approach OBC, which through additional algorithmic enhancements leads to GPTQ and SparseGPT, the first techniques fast and accurate enough to compress 100+ billion parameter models to 4- or even 3-bit precision and 50% weight-sparsity, respectively. Additionally, we show how weight-only quantizion does not just bring space savings but also up to 4.5x faster generation speed, via custom GPU kernels.\r\n\r\nIn fact, we show for the first time that it is possible to develop an FP16 times INT4 mixed-precision matrix multiplication kernel, called Marlin, which comes close to simultaneously maximizing both memory and compute utilization, making weight-only quantization highly practical even for multi-user serving. Further, we demonstrate that GPTQ can be scaled to widely overparametrized trillion-parameter models, where extreme sub-1-bit compression rates can be achieved without any inference slow-down, by co-designing a bespoke entropy coding scheme together with an efficient kernel.\r\n\r\nFinally, we also study compression from the perspective of someone with access to massive amounts of compute resources for training large models completely from scratch. Here the key questions evolve around the joint scaling behavior between compression, model size, and amount of training data used. Based on extensive experimental results for both vision and text models, we introduce the first scaling law which accurately captures the relationship between weight-sparsity, number of non-zero weights and data. This further allows us to characterize the optimal sparsity, which we find to increase the longer a fixed cost model is being trained.\r\n\r\nOverall, this thesis presents contributions to three different angles of large model efficiency: affordable but accurate algorithms, highly efficient systems implementations, and fundamental scaling laws for compressed training.","lang":"eng"}],"oa":1,"alternative_title":["ISTA Thesis"],"author":[{"last_name":"Frantar","id":"09a8f98d-ec99-11ea-ae11-c063a7b7fe5f","full_name":"Frantar, Elias","first_name":"Elias"}],"publication_identifier":{"issn":["2663-337X"]},"publisher":"Institute of Science and Technology Austria","publication_status":"published","doi":"10.15479/at:ista:17485","supervisor":[{"id":"4A899BFC-F248-11E8-B48F-1D18A9856A87","last_name":"Alistarh","first_name":"Dan-Adrian","full_name":"Alistarh, Dan-Adrian","orcid":"0000-0003-3650-940X"}],"file_date_updated":"2024-09-06T16:24:59Z","acknowledged_ssus":[{"_id":"ScienComp"}],"ec_funded":1,"citation":{"ista":"Frantar E. 2024. Compressing large neural networks : Algorithms, systems and scaling laws. Institute of Science and Technology Austria.","short":"E. Frantar, Compressing Large Neural Networks : Algorithms, Systems and Scaling Laws, Institute of Science and Technology Austria, 2024.","apa":"Frantar, E. (2024). <i>Compressing large neural networks : Algorithms, systems and scaling laws</i>. Institute of Science and Technology Austria. <a href=\"https://doi.org/10.15479/at:ista:17485\">https://doi.org/10.15479/at:ista:17485</a>","ieee":"E. Frantar, “Compressing large neural networks : Algorithms, systems and scaling laws,” Institute of Science and Technology Austria, 2024.","chicago":"Frantar, Elias. “Compressing Large Neural Networks : Algorithms, Systems and Scaling Laws.” Institute of Science and Technology Austria, 2024. <a href=\"https://doi.org/10.15479/at:ista:17485\">https://doi.org/10.15479/at:ista:17485</a>.","mla":"Frantar, Elias. <i>Compressing Large Neural Networks : Algorithms, Systems and Scaling Laws</i>. Institute of Science and Technology Austria, 2024, doi:<a href=\"https://doi.org/10.15479/at:ista:17485\">10.15479/at:ista:17485</a>.","ama":"Frantar E. Compressing large neural networks : Algorithms, systems and scaling laws. 2024. doi:<a href=\"https://doi.org/10.15479/at:ista:17485\">10.15479/at:ista:17485</a>"},"has_accepted_license":"1","date_created":"2024-09-02T11:01:48Z","oa_version":"Published Version","related_material":{"record":[{"id":"17087","relation":"part_of_dissertation","status":"public"},{"id":"17378","relation":"part_of_dissertation","status":"public"},{"id":"18062","relation":"part_of_dissertation","status":"public"},{"status":"public","relation":"part_of_dissertation","id":"18061"},{"status":"public","relation":"part_of_dissertation","id":"14458"}]},"_id":"17485","page":"129","date_published":"2024-09-05T00:00:00Z","status":"public","project":[{"grant_number":"805223","name":"Elastic Coordination for Scalable Machine Learning","call_identifier":"H2020","_id":"268A44D6-B435-11E9-9278-68D0E5697425"}],"user_id":"8b945eb4-e2f2-11eb-945a-df72226e66a9","type":"dissertation","ddc":["000"]}