Data-efficient learning via clustering-based sensitivity sampling: Foundation models and beyond

Axiotis, Kyriakos; Cohen-Addad, Vincent; Henzinger, Monika H; Jerome, Sammy; Mirrokni, Vahab; Saulpic, David; Woodruff, David P.; Wunder, Michael

Data-efficient learning via clustering-based sensitivity sampling: Foundation models and beyond

Axiotis K, Cohen-Addad V, Henzinger M, Jerome S, Mirrokni V, Saulpic D, Woodruff DP, Wunder M. 2024. Data-efficient learning via clustering-based sensitivity sampling: Foundation models and beyond. Proceedings of the 41st International Conference on Machine Learning. ICML: International Conference on Machine Learning, PMLR, vol. 235, 2086–2107.

Download (ext.)

https://doi.org/10.48550/arXiv.2402.17327 [Published Version]

Conference Paper | Published | English

Scopus indexed

Author

Axiotis, Kyriakos; Cohen-Addad, Vincent; Henzinger, Monika^ISTA ; Jerome, Sammy; Mirrokni, Vahab; Saulpic, David^ISTA; Woodruff, David P.; Wunder, Michael

Department

Henzinger_Monika Group

Grant

The design and evaluation of modern fully dynamic data structures
Efficient algorithms
Static and Dynamic Hierarchical Graph Decompositions
Fast Algorithms for a Reactive Network Layer
IST-BRIDGE: International postdoctoral program

Series Title

PMLR

Abstract

We study the data selection problem, whose aim is to select a small representative subset of data that can be used to efficiently train a machine learning model. We present a new data selection approach based on k-means clustering and sensitivity sampling. Assuming access to an embedding representation of the data with respect to which the model loss is Holder continuous, our approach provably allows selecting a set of “typical” k+1/ε2 elements whose average loss corresponds to the average loss of the whole dataset, up to a multiplicative (1±ε) factor and an additive ελΦk, where Φk represents the k-means cost for the input embeddings and λ is the Holder constant. We furthermore demonstrate the performance and scalability of our approach on fine-tuning foundation models and show that it outperforms state-of-the-art methods. We also show how it can be applied on linear regression, leading to a new sampling strategy that surprisingly matches the performance of leverage score sampling, while being conceptually simpler and more scalable.

Publishing Year

2024

Date Published

2024-09-01

Proceedings Title

Proceedings of the 41st International Conference on Machine Learning

Publisher

ML Research Press

Acknowledgement

Monika Henzinger: This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (Grant agreement No. 101019564) and the Austrian Science Fund (FWF) grant DOI 10.55776/Z422, grant DOI 10.55776/I5982, and grant DOI 10.55776/P33775 with additional funding from the netidee SCIENCE Stiftung, 2020–2024. This work was partially done while David Saulpic was at the Institute for Science and Technology, Austria (ISTA). David Sauplic has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie grant agreement No 101034413. Work was done while David Woodruff was visiting Google Research.

Volume

235

Page

2086-2107

Conference

ICML: International Conference on Machine Learning

Conference Location

Vienna, Austria

Conference Date

2024-07-21 – 2024-07-27

eISSN

2640-3498

IST-REx-ID

18115

Cite this

Axiotis K, Cohen-Addad V, Henzinger M, et al. Data-efficient learning via clustering-based sensitivity sampling: Foundation models and beyond. In: Proceedings of the 41st International Conference on Machine Learning. Vol 235. ML Research Press; 2024:2086-2107.

Axiotis, K., Cohen-Addad, V., Henzinger, M., Jerome, S., Mirrokni, V., Saulpic, D., … Wunder, M. (2024). Data-efficient learning via clustering-based sensitivity sampling: Foundation models and beyond. In Proceedings of the 41st International Conference on Machine Learning (Vol. 235, pp. 2086–2107). Vienna, Austria: ML Research Press.

Axiotis, Kyriakos, Vincent Cohen-Addad, Monika Henzinger, Sammy Jerome, Vahab Mirrokni, David Saulpic, David P. Woodruff, and Michael Wunder. “Data-Efficient Learning via Clustering-Based Sensitivity Sampling: Foundation Models and Beyond.” In Proceedings of the 41st International Conference on Machine Learning, 235:2086–2107. ML Research Press, 2024.

K. Axiotis et al., “Data-efficient learning via clustering-based sensitivity sampling: Foundation models and beyond,” in Proceedings of the 41st International Conference on Machine Learning, Vienna, Austria, 2024, vol. 235, pp. 2086–2107.

Axiotis, Kyriakos, et al. “Data-Efficient Learning via Clustering-Based Sensitivity Sampling: Foundation Models and Beyond.” Proceedings of the 41st International Conference on Machine Learning, vol. 235, ML Research Press, 2024, pp. 2086–107.

All files available under the following license(s):

Copyright Statement:

This Item is protected by copyright and/or related rights. [...]

Link(s) to Main File(s)

URL

https://doi.org/10.48550/arXiv.2402.17327

Access Level

Open Access

Export

Marked Publications

Open Data ISTA Research Explorer

Sources

arXiv 2402.17327

Search this title in

Google Scholar