<?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
         xmlns:dc="http://purl.org/dc/terms/"
         xmlns:foaf="http://xmlns.com/foaf/0.1/"
         xmlns:bibo="http://purl.org/ontology/bibo/"
         xmlns:fabio="http://purl.org/spar/fabio/"
         xmlns:owl="http://www.w3.org/2002/07/owl#"
         xmlns:event="http://purl.org/NET/c4dm/event.owl#"
         xmlns:ore="http://www.openarchives.org/ore/terms/">

    <rdf:Description rdf:about="https://research-explorer.ista.ac.at/record/18115">
        <ore:isDescribedBy rdf:resource="https://research-explorer.ista.ac.at/record/18115"/>
        <dc:title>Data-efficient learning via clustering-based sensitivity sampling: Foundation models and beyond</dc:title>
        <bibo:authorList rdf:parseType="Collection">
            <foaf:Person>
                <foaf:name></foaf:name>
                <foaf:surname></foaf:surname>
                <foaf:givenname></foaf:givenname>
            </foaf:Person>
            <foaf:Person>
                <foaf:name></foaf:name>
                <foaf:surname></foaf:surname>
                <foaf:givenname></foaf:givenname>
            </foaf:Person>
            <foaf:Person>
                <foaf:name></foaf:name>
                <foaf:surname></foaf:surname>
                <foaf:givenname></foaf:givenname>
            </foaf:Person>
            <foaf:Person>
                <foaf:name></foaf:name>
                <foaf:surname></foaf:surname>
                <foaf:givenname></foaf:givenname>
            </foaf:Person>
            <foaf:Person>
                <foaf:name></foaf:name>
                <foaf:surname></foaf:surname>
                <foaf:givenname></foaf:givenname>
            </foaf:Person>
            <foaf:Person>
                <foaf:name></foaf:name>
                <foaf:surname></foaf:surname>
                <foaf:givenname></foaf:givenname>
            </foaf:Person>
            <foaf:Person>
                <foaf:name></foaf:name>
                <foaf:surname></foaf:surname>
                <foaf:givenname></foaf:givenname>
            </foaf:Person>
            <foaf:Person>
                <foaf:name></foaf:name>
                <foaf:surname></foaf:surname>
                <foaf:givenname></foaf:givenname>
            </foaf:Person>
        </bibo:authorList>
        <bibo:abstract>We study the data selection problem, whose aim is to select a small representative subset of data that can be used to efficiently train a machine learning model. We present a new data selection approach based on k-means clustering and sensitivity sampling. Assuming access to an embedding representation of the data with respect to which the model loss is Holder continuous, our approach provably allows selecting a set of “typical” k+1/ε2 elements whose average loss corresponds to the average loss of the whole dataset, up to a multiplicative (1±ε)
 factor and an additive ελΦk, where Φk represents the k-means cost for the input embeddings and λ is the Holder constant. We furthermore demonstrate the performance and scalability of our approach on fine-tuning foundation models and show that it outperforms state-of-the-art methods. We also show how it can be applied on linear regression, leading to a new sampling strategy that surprisingly matches the performance of leverage score sampling, while being conceptually simpler and more scalable.</bibo:abstract>
        <bibo:volume>235</bibo:volume>
        <bibo:startPage>2086-2107</bibo:startPage>
        <bibo:endPage>2086-2107</bibo:endPage>
        <dc:publisher>ML Research Press</dc:publisher>
    </rdf:Description>
</rdf:RDF>
