<?xml version="1.0" encoding="UTF-8"?>
<OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd">
<ListRecords>
<oai_dc:dc xmlns="http://www.openarchives.org/OAI/2.0/oai_dc/"
           xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/"
           xmlns:dc="http://purl.org/dc/elements/1.1/"
           xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
           xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
   	<dc:title>ASIF: Coupled data turns unimodal models to multimodal without training</dc:title>
   	<dc:title>Advances in Neural Information Processing Systems</dc:title>
   	<dc:creator>Norelli, Antonio</dc:creator>
   	<dc:creator>Fumero, Marco</dc:creator>
   	<dc:creator>Maiorca, Valentino</dc:creator>
   	<dc:creator>Moschella, Luca</dc:creator>
   	<dc:creator>Rodolà, Emanuele</dc:creator>
   	<dc:creator>Locatello, Francesco ; https://orcid.org/0000-0002-4850-0683</dc:creator>
   	<dc:subject>ddc:000</dc:subject>
   	<dc:description>CLIP proved that aligning visual and language spaces is key to solving many vision tasks without explicit training, but required to train image and text encoders from scratch on a huge dataset. LiT improved this by only training the text encoder and using a pre-trained vision network. In this paper, we show that a common space can be created without any training at all, using single-domain encoders (trained with or without supervision) and a much smaller amount of image-text pairs. Furthermore, our model has unique properties. Most notably, deploying a new version with updated training samples can be done in a matter of seconds. Additionally, the representations in the common space are easily interpretable as every dimension corresponds to the similarity of the input to a unique entry in the multimodal dataset. Experiments on standard zero-shot visual benchmarks demonstrate the typical transfer ability of image-text models. Overall, our method represents a simple yet surprisingly strong baseline for foundation multi-modal models, raising important questions on their data efficiency and on the role of retrieval in machine learning.</dc:description>
   	<dc:publisher>Neural Information Processing Systems Foundation</dc:publisher>
   	<dc:date>2023</dc:date>
   	<dc:type>info:eu-repo/semantics/conferenceObject</dc:type>
   	<dc:type>doc-type:conferenceObject</dc:type>
   	<dc:type>text</dc:type>
   	<dc:identifier>https://research-explorer.ista.ac.at/record/14216</dc:identifier>
   	<dc:identifier>https://research-explorer.ista.ac.at/download/14216/18994</dc:identifier>
   	<dc:source>Norelli A, Fumero M, Maiorca V, Moschella L, Rodolà E, Locatello F. ASIF: Coupled data turns unimodal models to multimodal without training. In: &lt;i&gt;37th Conference on Neural Information Processing Systems&lt;/i&gt;. Vol 36. Neural Information Processing Systems Foundation; 2023:15303-15319.</dc:source>
   	<dc:language>eng</dc:language>
   	<dc:relation>info:eu-repo/semantics/altIdentifier/isbn/9781713899921</dc:relation>
   	<dc:relation>info:eu-repo/semantics/altIdentifier/arxiv/2210.01738</dc:relation>
   	<dc:rights>info:eu-repo/semantics/openAccess</dc:rights>
</oai_dc:dc>
</ListRecords>
</OAI-PMH>
