<?xml version="1.0" encoding="UTF-8"?>

<modsCollection xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.loc.gov/mods/v3" xsi:schemaLocation="http://www.loc.gov/mods/v3 http://www.loc.gov/standards/mods/v3/mods-3-3.xsd">
<mods version="3.3">

<genre>conference paper</genre>

<titleInfo><title>ASIF: Coupled data turns unimodal models to multimodal without training</title></titleInfo>

  
  
<titleInfo type="alternative">
  
  <title>Advances in Neural Information Processing Systems</title>
</titleInfo>

<note type="publicationStatus">published</note>


<note type="qualityControlled">yes</note>

<name type="personal">
  <namePart type="given">Antonio</namePart>
  <namePart type="family">Norelli</namePart>
  <role><roleTerm type="text">author</roleTerm> </role></name>
<name type="personal">
  <namePart type="given">Marco</namePart>
  <namePart type="family">Fumero</namePart>
  <role><roleTerm type="text">author</roleTerm> </role></name>
<name type="personal">
  <namePart type="given">Valentino</namePart>
  <namePart type="family">Maiorca</namePart>
  <role><roleTerm type="text">author</roleTerm> </role></name>
<name type="personal">
  <namePart type="given">Luca</namePart>
  <namePart type="family">Moschella</namePart>
  <role><roleTerm type="text">author</roleTerm> </role></name>
<name type="personal">
  <namePart type="given">Emanuele</namePart>
  <namePart type="family">Rodolà</namePart>
  <role><roleTerm type="text">author</roleTerm> </role></name>
<name type="personal">
  <namePart type="given">Francesco</namePart>
  <namePart type="family">Locatello</namePart>
  <role><roleTerm type="text">author</roleTerm> </role><identifier type="local">26cfd52f-2483-11ee-8040-88983bcc06d4</identifier><description xsi:type="identifierDefinition" type="orcid">0000-0002-4850-0683</description></name>







<name type="corporate">
  <namePart></namePart>
  <identifier type="local">FrLo</identifier>
  <role>
    <roleTerm type="text">department</roleTerm>
  </role>
</name>



<name type="conference">
  <namePart>NeurIPS: Neural Information Processing Systems</namePart>
</name>






<abstract lang="eng">CLIP proved that aligning visual and language spaces is key to solving many vision tasks without explicit training, but required to train image and text encoders from scratch on a huge dataset. LiT improved this by only training the text encoder and using a pre-trained vision network. In this paper, we show that a common space can be created without any training at all, using single-domain encoders (trained with or without supervision) and a much smaller amount of image-text pairs. Furthermore, our model has unique properties. Most notably, deploying a new version with updated training samples can be done in a matter of seconds. Additionally, the representations in the common space are easily interpretable as every dimension corresponds to the similarity of the input to a unique entry in the multimodal dataset. Experiments on standard zero-shot visual benchmarks demonstrate the typical transfer ability of image-text models. Overall, our method represents a simple yet surprisingly strong baseline for foundation multi-modal models, raising important questions on their data efficiency and on the role of retrieval in machine learning.</abstract>

<relatedItem type="constituent">
  <location>
    <url displayLabel="2023_NeurIPS_Fumero.pdf">https://research-explorer.ista.ac.at/download/14216/18994/2023_NeurIPS_Fumero.pdf</url>
  </location>
  <physicalDescription><internetMediaType>application/pdf</internetMediaType></physicalDescription><accessCondition type="restrictionOnAccess">no</accessCondition>
</relatedItem>
<originInfo><publisher>Neural Information Processing Systems Foundation</publisher><dateIssued encoding="w3cdtf">2023</dateIssued><place><placeTerm type="text">New Orleans, LA, United States</placeTerm></place>
</originInfo>
<language><languageTerm authority="iso639-2b" type="code">eng</languageTerm>
</language>



<relatedItem type="host"><titleInfo><title>37th Conference on Neural Information Processing Systems</title></titleInfo>
  <identifier type="isbn">9781713899921</identifier>
  <identifier type="arXiv">2210.01738</identifier>
<part><detail type="volume"><number>36</number></detail><extent unit="pages">15303-15319</extent>
</part>
</relatedItem>


<relatedItem type="Supplementary material">
  <location>
  
     <url>https://github.com/noranta4/ASIF</url>
  
  </location>
</relatedItem>

<extension>
<bibliographicCitation>
<ama>Norelli A, Fumero M, Maiorca V, Moschella L, Rodolà E, Locatello F. ASIF: Coupled data turns unimodal models to multimodal without training. In: &lt;i&gt;37th Conference on Neural Information Processing Systems&lt;/i&gt;. Vol 36. Neural Information Processing Systems Foundation; 2023:15303-15319.</ama>
<ieee>A. Norelli, M. Fumero, V. Maiorca, L. Moschella, E. Rodolà, and F. Locatello, “ASIF: Coupled data turns unimodal models to multimodal without training,” in &lt;i&gt;37th Conference on Neural Information Processing Systems&lt;/i&gt;, New Orleans, LA, United States, 2023, vol. 36, pp. 15303–15319.</ieee>
<chicago>Norelli, Antonio, Marco Fumero, Valentino Maiorca, Luca Moschella, Emanuele Rodolà, and Francesco Locatello. “ASIF: Coupled Data Turns Unimodal Models to Multimodal without Training.” In &lt;i&gt;37th Conference on Neural Information Processing Systems&lt;/i&gt;, 36:15303–19. Neural Information Processing Systems Foundation, 2023.</chicago>
<short>A. Norelli, M. Fumero, V. Maiorca, L. Moschella, E. Rodolà, F. Locatello, in:, 37th Conference on Neural Information Processing Systems, Neural Information Processing Systems Foundation, 2023, pp. 15303–15319.</short>
<ista>Norelli A, Fumero M, Maiorca V, Moschella L, Rodolà E, Locatello F. 2023. ASIF: Coupled data turns unimodal models to multimodal without training. 37th Conference on Neural Information Processing Systems. NeurIPS: Neural Information Processing Systems, Advances in Neural Information Processing Systems, vol. 36, 15303–15319.</ista>
<mla>Norelli, Antonio, et al. “ASIF: Coupled Data Turns Unimodal Models to Multimodal without Training.” &lt;i&gt;37th Conference on Neural Information Processing Systems&lt;/i&gt;, vol. 36, Neural Information Processing Systems Foundation, 2023, pp. 15303–19.</mla>
<apa>Norelli, A., Fumero, M., Maiorca, V., Moschella, L., Rodolà, E., &amp;#38; Locatello, F. (2023). ASIF: Coupled data turns unimodal models to multimodal without training. In &lt;i&gt;37th Conference on Neural Information Processing Systems&lt;/i&gt; (Vol. 36, pp. 15303–15319). New Orleans, LA, United States: Neural Information Processing Systems Foundation.</apa>
</bibliographicCitation>
</extension>
<recordInfo><recordIdentifier>14216</recordIdentifier><recordCreationDate encoding="w3cdtf">2023-08-22T14:22:04Z</recordCreationDate><recordChangeDate encoding="w3cdtf">2025-05-14T11:28:52Z</recordChangeDate>
</recordInfo>
</mods>
</modsCollection>
