{"user_id":"2DF688A6-F248-11E8-B48F-1D18A9856A87","oa_version":"Published Version","department":[{"_id":"MoHe"}],"main_file_link":[{"url":"https://doi.org/10.48550/arXiv.2402.17327","open_access":"1"}],"day":"01","publication":"Proceedings of the 41st International Conference on Machine Learning","publisher":"ML Research Press","publication_identifier":{"eissn":["2640-3498"]},"quality_controlled":"1","oa":1,"page":"2086-2107","conference":{"end_date":"2024-07-27","start_date":"2024-07-21","name":"ICML: International Conference on Machine Learning","location":"Vienna, Austria"},"volume":235,"type":"conference","date_created":"2024-09-22T22:01:44Z","acknowledgement":"Monika Henzinger: This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (Grant agreement No. 101019564) and the Austrian Science Fund (FWF) grant DOI 10.55776/Z422, grant DOI 10.55776/I5982, and grant DOI 10.55776/P33775 with additional funding from the netidee SCIENCE Stiftung, 2020–2024. This work was partially done while David Saulpic was at the Institute for Science and Technology, Austria (ISTA). David Sauplic has received funding from the European Union’s Horizon 2020 research and innovation programme under the\r\nMarie Sklodowska-Curie grant agreement No 101034413. Work was done while David Woodruff was visiting Google Research.","project":[{"grant_number":"101019564","name":"The design and evaluation of modern fully dynamic data structures","call_identifier":"H2020","_id":"bd9ca328-d553-11ed-ba76-dc4f890cfe62"},{"grant_number":"Z00422","name":"Wittgenstein Award - Monika Henzinger","_id":"34def286-11ca-11ed-8bc3-da5948e1613c"},{"grant_number":"I05982","name":"Static and Dynamic Hierarchical Graph Decompositions","_id":"bda196b2-d553-11ed-ba76-8e8ee6c21103"},{"_id":"bd9e3a2e-d553-11ed-ba76-8aa684ce17fe","name":"Fast Algorithms for a Reactive Network Layer","grant_number":"P33775 "},{"_id":"fc2ed2f7-9c52-11eb-aca3-c01059dda49c","call_identifier":"H2020","grant_number":"101034413","name":"IST-BRIDGE: International postdoctoral program"}],"alternative_title":["PMLR"],"intvolume":" 235","external_id":{"arxiv":["2402.17327"]},"abstract":[{"lang":"eng","text":"We study the data selection problem, whose aim is to select a small representative subset of data that can be used to efficiently train a machine learning model. We present a new data selection approach based on k-means clustering and sensitivity sampling. Assuming access to an embedding representation of the data with respect to which the model loss is Holder continuous, our approach provably allows selecting a set of “typical” k+1/ε2 elements whose average loss corresponds to the average loss of the whole dataset, up to a multiplicative (1±ε)\r\n factor and an additive ελΦk, where Φk represents the k-means cost for the input embeddings and λ is the Holder constant. We furthermore demonstrate the performance and scalability of our approach on fine-tuning foundation models and show that it outperforms state-of-the-art methods. We also show how it can be applied on linear regression, leading to a new sampling strategy that surprisingly matches the performance of leverage score sampling, while being conceptually simpler and more scalable."}],"date_updated":"2024-10-01T08:35:31Z","_id":"18115","article_processing_charge":"No","title":"Data-efficient learning via clustering-based sensitivity sampling: Foundation models and beyond","status":"public","date_published":"2024-09-01T00:00:00Z","publication_status":"published","citation":{"ama":"Axiotis K, Cohen-Addad V, Henzinger MH, et al. Data-efficient learning via clustering-based sensitivity sampling: Foundation models and beyond. In: Proceedings of the 41st International Conference on Machine Learning. Vol 235. ML Research Press; 2024:2086-2107.","short":"K. Axiotis, V. Cohen-Addad, M.H. Henzinger, S. Jerome, V. Mirrokni, D. Saulpic, D.P. Woodruff, M. Wunder, in:, Proceedings of the 41st International Conference on Machine Learning, ML Research Press, 2024, pp. 2086–2107.","chicago":"Axiotis, Kyriakos, Vincent Cohen-Addad, Monika H Henzinger, Sammy Jerome, Vahab Mirrokni, David Saulpic, David P. Woodruff, and Michael Wunder. “Data-Efficient Learning via Clustering-Based Sensitivity Sampling: Foundation Models and Beyond.” In Proceedings of the 41st International Conference on Machine Learning, 235:2086–2107. ML Research Press, 2024.","ista":"Axiotis K, Cohen-Addad V, Henzinger MH, Jerome S, Mirrokni V, Saulpic D, Woodruff DP, Wunder M. 2024. Data-efficient learning via clustering-based sensitivity sampling: Foundation models and beyond. Proceedings of the 41st International Conference on Machine Learning. ICML: International Conference on Machine Learning, PMLR, vol. 235, 2086–2107.","ieee":"K. Axiotis et al., “Data-efficient learning via clustering-based sensitivity sampling: Foundation models and beyond,” in Proceedings of the 41st International Conference on Machine Learning, Vienna, Austria, 2024, vol. 235, pp. 2086–2107.","apa":"Axiotis, K., Cohen-Addad, V., Henzinger, M. H., Jerome, S., Mirrokni, V., Saulpic, D., … Wunder, M. (2024). Data-efficient learning via clustering-based sensitivity sampling: Foundation models and beyond. In Proceedings of the 41st International Conference on Machine Learning (Vol. 235, pp. 2086–2107). Vienna, Austria: ML Research Press.","mla":"Axiotis, Kyriakos, et al. “Data-Efficient Learning via Clustering-Based Sensitivity Sampling: Foundation Models and Beyond.” Proceedings of the 41st International Conference on Machine Learning, vol. 235, ML Research Press, 2024, pp. 2086–107."},"month":"09","language":[{"iso":"eng"}],"author":[{"last_name":"Axiotis","first_name":"Kyriakos","full_name":"Axiotis, Kyriakos"},{"first_name":"Vincent","last_name":"Cohen-Addad","full_name":"Cohen-Addad, Vincent"},{"full_name":"Henzinger, Monika H","id":"540c9bbd-f2de-11ec-812d-d04a5be85630","first_name":"Monika H","last_name":"Henzinger","orcid":"0000-0002-5008-6530"},{"first_name":"Sammy","last_name":"Jerome","full_name":"Jerome, Sammy"},{"last_name":"Mirrokni","first_name":"Vahab","full_name":"Mirrokni, Vahab"},{"first_name":"David","last_name":"Saulpic","full_name":"Saulpic, David","id":"f8e48cf0-b0ff-11ed-b0e9-b4c35598f964"},{"last_name":"Woodruff","first_name":"David P.","full_name":"Woodruff, David P."},{"last_name":"Wunder","first_name":"Michael","full_name":"Wunder, Michael"}],"year":"2024","ec_funded":1,"scopus_import":"1"}