<?xml version="1.0" encoding="UTF-8"?>

<modsCollection xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.loc.gov/mods/v3" xsi:schemaLocation="http://www.loc.gov/mods/v3 http://www.loc.gov/standards/mods/v3/mods-3-3.xsd">
<mods version="3.3">

<genre>conference paper</genre>

<titleInfo><title>Data-efficient learning via clustering-based sensitivity sampling: Foundation models and beyond</title></titleInfo>

  
  
<titleInfo type="alternative">
  
  <title>PMLR</title>
</titleInfo>

<note type="publicationStatus">published</note>


<note type="qualityControlled">yes</note>

<name type="personal">
  <namePart type="given">Kyriakos</namePart>
  <namePart type="family">Axiotis</namePart>
  <role><roleTerm type="text">author</roleTerm> </role></name>
<name type="personal">
  <namePart type="given">Vincent</namePart>
  <namePart type="family">Cohen-Addad</namePart>
  <role><roleTerm type="text">author</roleTerm> </role></name>
<name type="personal">
  <namePart type="given">Monika H</namePart>
  <namePart type="family">Henzinger</namePart>
  <role><roleTerm type="text">author</roleTerm> </role><identifier type="local">540c9bbd-f2de-11ec-812d-d04a5be85630</identifier><description xsi:type="identifierDefinition" type="orcid">0000-0002-5008-6530</description></name>
<name type="personal">
  <namePart type="given">Sammy</namePart>
  <namePart type="family">Jerome</namePart>
  <role><roleTerm type="text">author</roleTerm> </role></name>
<name type="personal">
  <namePart type="given">Vahab</namePart>
  <namePart type="family">Mirrokni</namePart>
  <role><roleTerm type="text">author</roleTerm> </role></name>
<name type="personal">
  <namePart type="given">David</namePart>
  <namePart type="family">Saulpic</namePart>
  <role><roleTerm type="text">author</roleTerm> </role><identifier type="local">f8e48cf0-b0ff-11ed-b0e9-b4c35598f964</identifier></name>
<name type="personal">
  <namePart type="given">David P.</namePart>
  <namePart type="family">Woodruff</namePart>
  <role><roleTerm type="text">author</roleTerm> </role></name>
<name type="personal">
  <namePart type="given">Michael</namePart>
  <namePart type="family">Wunder</namePart>
  <role><roleTerm type="text">author</roleTerm> </role></name>







<name type="corporate">
  <namePart></namePart>
  <identifier type="local">MoHe</identifier>
  <role>
    <roleTerm type="text">department</roleTerm>
  </role>
</name>



<name type="conference">
  <namePart>ICML: International Conference on Machine Learning</namePart>
</name>



<name type="corporate">
  <namePart>The design and evaluation of modern fully dynamic data structures</namePart>
  <role><roleTerm type="text">project</roleTerm></role>
</name>
<name type="corporate">
  <namePart>Efficient algorithms</namePart>
  <role><roleTerm type="text">project</roleTerm></role>
</name>
<name type="corporate">
  <namePart>Static and Dynamic Hierarchical Graph Decompositions</namePart>
  <role><roleTerm type="text">project</roleTerm></role>
</name>
<name type="corporate">
  <namePart>Fast Algorithms for a Reactive Network Layer</namePart>
  <role><roleTerm type="text">project</roleTerm></role>
</name>
<name type="corporate">
  <namePart>IST-BRIDGE: International postdoctoral program</namePart>
  <role><roleTerm type="text">project</roleTerm></role>
</name>



<abstract lang="eng">We study the data selection problem, whose aim is to select a small representative subset of data that can be used to efficiently train a machine learning model. We present a new data selection approach based on k-means clustering and sensitivity sampling. Assuming access to an embedding representation of the data with respect to which the model loss is Holder continuous, our approach provably allows selecting a set of “typical” k+1/ε2 elements whose average loss corresponds to the average loss of the whole dataset, up to a multiplicative (1±ε)
 factor and an additive ελΦk, where Φk represents the k-means cost for the input embeddings and λ is the Holder constant. We furthermore demonstrate the performance and scalability of our approach on fine-tuning foundation models and show that it outperforms state-of-the-art methods. We also show how it can be applied on linear regression, leading to a new sampling strategy that surprisingly matches the performance of leverage score sampling, while being conceptually simpler and more scalable.</abstract>

<originInfo><publisher>ML Research Press</publisher><dateIssued encoding="w3cdtf">2024</dateIssued><place><placeTerm type="text">Vienna, Austria</placeTerm></place>
</originInfo>
<language><languageTerm authority="iso639-2b" type="code">eng</languageTerm>
</language>



<relatedItem type="host"><titleInfo><title>Proceedings of the 41st International Conference on Machine Learning</title></titleInfo>
  <identifier type="eIssn">2640-3498</identifier>
  <identifier type="arXiv">2402.17327</identifier>
<part><detail type="volume"><number>235</number></detail><extent unit="pages">2086-2107</extent>
</part>
</relatedItem>


<extension>
<bibliographicCitation>
<ieee>K. Axiotis &lt;i&gt;et al.&lt;/i&gt;, “Data-efficient learning via clustering-based sensitivity sampling: Foundation models and beyond,” in &lt;i&gt;Proceedings of the 41st International Conference on Machine Learning&lt;/i&gt;, Vienna, Austria, 2024, vol. 235, pp. 2086–2107.</ieee>
<chicago>Axiotis, Kyriakos, Vincent Cohen-Addad, Monika Henzinger, Sammy Jerome, Vahab Mirrokni, David Saulpic, David P. Woodruff, and Michael Wunder. “Data-Efficient Learning via Clustering-Based Sensitivity Sampling: Foundation Models and Beyond.” In &lt;i&gt;Proceedings of the 41st International Conference on Machine Learning&lt;/i&gt;, 235:2086–2107. ML Research Press, 2024.</chicago>
<mla>Axiotis, Kyriakos, et al. “Data-Efficient Learning via Clustering-Based Sensitivity Sampling: Foundation Models and Beyond.” &lt;i&gt;Proceedings of the 41st International Conference on Machine Learning&lt;/i&gt;, vol. 235, ML Research Press, 2024, pp. 2086–107.</mla>
<apa>Axiotis, K., Cohen-Addad, V., Henzinger, M., Jerome, S., Mirrokni, V., Saulpic, D., … Wunder, M. (2024). Data-efficient learning via clustering-based sensitivity sampling: Foundation models and beyond. In &lt;i&gt;Proceedings of the 41st International Conference on Machine Learning&lt;/i&gt; (Vol. 235, pp. 2086–2107). Vienna, Austria: ML Research Press.</apa>
<ista>Axiotis K, Cohen-Addad V, Henzinger M, Jerome S, Mirrokni V, Saulpic D, Woodruff DP, Wunder M. 2024. Data-efficient learning via clustering-based sensitivity sampling: Foundation models and beyond. Proceedings of the 41st International Conference on Machine Learning. ICML: International Conference on Machine Learning, PMLR, vol. 235, 2086–2107.</ista>
<ama>Axiotis K, Cohen-Addad V, Henzinger M, et al. Data-efficient learning via clustering-based sensitivity sampling: Foundation models and beyond. In: &lt;i&gt;Proceedings of the 41st International Conference on Machine Learning&lt;/i&gt;. Vol 235. ML Research Press; 2024:2086-2107.</ama>
<short>K. Axiotis, V. Cohen-Addad, M. Henzinger, S. Jerome, V. Mirrokni, D. Saulpic, D.P. Woodruff, M. Wunder, in:, Proceedings of the 41st International Conference on Machine Learning, ML Research Press, 2024, pp. 2086–2107.</short>
</bibliographicCitation>
</extension>
<recordInfo><recordIdentifier>18115</recordIdentifier><recordCreationDate encoding="w3cdtf">2024-09-22T22:01:44Z</recordCreationDate><recordChangeDate encoding="w3cdtf">2025-04-14T13:50:50Z</recordChangeDate>
</recordInfo>
</mods>
</modsCollection>
