Data-efficient learning via clustering-based sensitivity sampling: Foundation models and beyond

conference paper Data-efficient learning via clustering-based sensitivity sampling: Foundation models and beyond PMLR published yes Kyriakos Axiotis author Vincent Cohen-Addad author Monika H Henzinger author 540c9bbd-f2de-11ec-812d-d04a5be856300000-0002-5008-6530 Sammy Jerome author Vahab Mirrokni author David Saulpic author f8e48cf0-b0ff-11ed-b0e9-b4c35598f964 David P. Woodruff author Michael Wunder author MoHe department ICML: International Conference on Machine Learning The design and evaluation of modern fully dynamic data structures project Wittgenstein Award - Monika Henzinger project Static and Dynamic Hierarchical Graph Decompositions project Fast Algorithms for a Reactive Network Layer project IST-BRIDGE: International postdoctoral program project We study the data selection problem, whose aim is to select a small representative subset of data that can be used to efficiently train a machine learning model. We present a new data selection approach based on k-means clustering and sensitivity sampling. Assuming access to an embedding representation of the data with respect to which the model loss is Holder continuous, our approach provably allows selecting a set of “typical” k+1/ε2 elements whose average loss corresponds to the average loss of the whole dataset, up to a multiplicative (1±ε) factor and an additive ελΦk, where Φk represents the k-means cost for the input embeddings and λ is the Holder constant. We furthermore demonstrate the performance and scalability of our approach on fine-tuning foundation models and show that it outperforms state-of-the-art methods. We also show how it can be applied on linear regression, leading to a new sampling strategy that surprisingly matches the performance of leverage score sampling, while being conceptually simpler and more scalable. ML Research Press2024Vienna, Austria eng Proceedings of the 41st International Conference on Machine Learning 2640-3498 2402.17327 2352086-2107 Axiotis K, Cohen-Addad V, Henzinger MH, et al. Data-efficient learning via clustering-based sensitivity sampling: Foundation models and beyond. In: Proceedings of the 41st International Conference on Machine Learning. Vol 235. ML Research Press; 2024:2086-2107. Axiotis, Kyriakos, Vincent Cohen-Addad, Monika H Henzinger, Sammy Jerome, Vahab Mirrokni, David Saulpic, David P. Woodruff, and Michael Wunder. “Data-Efficient Learning via Clustering-Based Sensitivity Sampling: Foundation Models and Beyond.” In Proceedings of the 41st International Conference on Machine Learning, 235:2086–2107. ML Research Press, 2024. K. Axiotis, V. Cohen-Addad, M.H. Henzinger, S. Jerome, V. Mirrokni, D. Saulpic, D.P. Woodruff, M. Wunder, in:, Proceedings of the 41st International Conference on Machine Learning, ML Research Press, 2024, pp. 2086–2107. K. Axiotis et al., “Data-efficient learning via clustering-based sensitivity sampling: Foundation models and beyond,” in Proceedings of the 41st International Conference on Machine Learning, Vienna, Austria, 2024, vol. 235, pp. 2086–2107. Axiotis K, Cohen-Addad V, Henzinger MH, Jerome S, Mirrokni V, Saulpic D, Woodruff DP, Wunder M. 2024. Data-efficient learning via clustering-based sensitivity sampling: Foundation models and beyond. Proceedings of the 41st International Conference on Machine Learning. ICML: International Conference on Machine Learning, PMLR, vol. 235, 2086–2107. Axiotis, Kyriakos, et al. “Data-Efficient Learning via Clustering-Based Sensitivity Sampling: Foundation Models and Beyond.” Proceedings of the 41st International Conference on Machine Learning, vol. 235, ML Research Press, 2024, pp. 2086–107. Axiotis, K., Cohen-Addad, V., Henzinger, M. H., Jerome, S., Mirrokni, V., Saulpic, D., … Wunder, M. (2024). Data-efficient learning via clustering-based sensitivity sampling: Foundation models and beyond. In Proceedings of the 41st International Conference on Machine Learning (Vol. 235, pp. 2086–2107). Vienna, Austria: ML Research Press. 181152024-09-22T22:01:44Z2024-10-01T08:35:31Z