Data-efficient learning via clustering-based sensitivity sampling: Foundation models and beyond
Axiotis K, Cohen-Addad V, Henzinger MH, Jerome S, Mirrokni V, Saulpic D, Woodruff DP, Wunder M. 2024. Data-efficient learning via clustering-based sensitivity sampling: Foundation models and beyond. Proceedings of the 41st International Conference on Machine Learning. ICML: International Conference on Machine Learning, PMLR, vol. 235, 2086–2107.
Download (ext.)
https://doi.org/10.48550/arXiv.2402.17327
[Published Version]
Conference Paper
| Published
| English
Scopus indexed
Author
Axiotis, Kyriakos;
Cohen-Addad, Vincent;
Henzinger, MonikaISTA ;
Jerome, Sammy;
Mirrokni, Vahab;
Saulpic, DavidISTA;
Woodruff, David P.;
Wunder, Michael
Department
Grant
Series Title
PMLR
Abstract
We study the data selection problem, whose aim is to select a small representative subset of data that can be used to efficiently train a machine learning model. We present a new data selection approach based on k-means clustering and sensitivity sampling. Assuming access to an embedding representation of the data with respect to which the model loss is Holder continuous, our approach provably allows selecting a set of “typical” k+1/ε2 elements whose average loss corresponds to the average loss of the whole dataset, up to a multiplicative (1±ε)
factor and an additive ελΦk, where Φk represents the k-means cost for the input embeddings and λ is the Holder constant. We furthermore demonstrate the performance and scalability of our approach on fine-tuning foundation models and show that it outperforms state-of-the-art methods. We also show how it can be applied on linear regression, leading to a new sampling strategy that surprisingly matches the performance of leverage score sampling, while being conceptually simpler and more scalable.
Publishing Year
Date Published
2024-09-01
Proceedings Title
Proceedings of the 41st International Conference on Machine Learning
Publisher
ML Research Press
Acknowledgement
Monika Henzinger: This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (Grant agreement No. 101019564) and the Austrian Science Fund (FWF) grant DOI 10.55776/Z422, grant DOI 10.55776/I5982, and grant DOI 10.55776/P33775 with additional funding from the netidee SCIENCE Stiftung, 2020–2024. This work was partially done while David Saulpic was at the Institute for Science and Technology, Austria (ISTA). David Sauplic has received funding from the European Union’s Horizon 2020 research and innovation programme under the
Marie Sklodowska-Curie grant agreement No 101034413. Work was done while David Woodruff was visiting Google Research.
Volume
235
Page
2086-2107
Conference
ICML: International Conference on Machine Learning
Conference Location
Vienna, Austria
Conference Date
2024-07-21 – 2024-07-27
eISSN
IST-REx-ID
Cite this
Axiotis K, Cohen-Addad V, Henzinger MH, et al. Data-efficient learning via clustering-based sensitivity sampling: Foundation models and beyond. In: Proceedings of the 41st International Conference on Machine Learning. Vol 235. ML Research Press; 2024:2086-2107.
Axiotis, K., Cohen-Addad, V., Henzinger, M. H., Jerome, S., Mirrokni, V., Saulpic, D., … Wunder, M. (2024). Data-efficient learning via clustering-based sensitivity sampling: Foundation models and beyond. In Proceedings of the 41st International Conference on Machine Learning (Vol. 235, pp. 2086–2107). Vienna, Austria: ML Research Press.
Axiotis, Kyriakos, Vincent Cohen-Addad, Monika H Henzinger, Sammy Jerome, Vahab Mirrokni, David Saulpic, David P. Woodruff, and Michael Wunder. “Data-Efficient Learning via Clustering-Based Sensitivity Sampling: Foundation Models and Beyond.” In Proceedings of the 41st International Conference on Machine Learning, 235:2086–2107. ML Research Press, 2024.
K. Axiotis et al., “Data-efficient learning via clustering-based sensitivity sampling: Foundation models and beyond,” in Proceedings of the 41st International Conference on Machine Learning, Vienna, Austria, 2024, vol. 235, pp. 2086–2107.
Axiotis K, Cohen-Addad V, Henzinger MH, Jerome S, Mirrokni V, Saulpic D, Woodruff DP, Wunder M. 2024. Data-efficient learning via clustering-based sensitivity sampling: Foundation models and beyond. Proceedings of the 41st International Conference on Machine Learning. ICML: International Conference on Machine Learning, PMLR, vol. 235, 2086–2107.
Axiotis, Kyriakos, et al. “Data-Efficient Learning via Clustering-Based Sensitivity Sampling: Foundation Models and Beyond.” Proceedings of the 41st International Conference on Machine Learning, vol. 235, ML Research Press, 2024, pp. 2086–107.
All files available under the following license(s):
Copyright Statement:
This Item is protected by copyright and/or related rights. [...]
Link(s) to Main File(s)
Access Level
Open Access
Export
Marked PublicationsOpen Data ISTA Research Explorer
Sources
arXiv 2402.17327