---
_id: '14216'
abstract:
- lang: eng
  text: CLIP proved that aligning visual and language spaces is key to solving many
    vision tasks without explicit training, but required to train image and text encoders
    from scratch on a huge dataset. LiT improved this by only training the text encoder
    and using a pre-trained vision network. In this paper, we show that a common space
    can be created without any training at all, using single-domain encoders (trained
    with or without supervision) and a much smaller amount of image-text pairs. Furthermore,
    our model has unique properties. Most notably, deploying a new version with updated
    training samples can be done in a matter of seconds. Additionally, the representations
    in the common space are easily interpretable as every dimension corresponds to
    the similarity of the input to a unique entry in the multimodal dataset. Experiments
    on standard zero-shot visual benchmarks demonstrate the typical transfer ability
    of image-text models. Overall, our method represents a simple yet surprisingly
    strong baseline for foundation multi-modal models, raising important questions
    on their data efficiency and on the role of retrieval in machine learning.
article_number: '2210.01738'
article_processing_charge: No
author:
- first_name: Antonio
  full_name: Norelli, Antonio
  last_name: Norelli
- first_name: Marco
  full_name: Fumero, Marco
  last_name: Fumero
- first_name: Valentino
  full_name: Maiorca, Valentino
  last_name: Maiorca
- first_name: Luca
  full_name: Moschella, Luca
  last_name: Moschella
- first_name: Emanuele
  full_name: Rodolà, Emanuele
  last_name: Rodolà
- first_name: Francesco
  full_name: Locatello, Francesco
  id: 26cfd52f-2483-11ee-8040-88983bcc06d4
  last_name: Locatello
  orcid: 0000-0002-4850-0683
citation:
  ama: 'Norelli A, Fumero M, Maiorca V, Moschella L, Rodolà E, Locatello F. ASIF:
    Coupled data turns unimodal models to multimodal without training. <i>arXiv</i>.
    doi:<a href="https://doi.org/10.48550/arXiv.2210.01738">10.48550/arXiv.2210.01738</a>'
  apa: 'Norelli, A., Fumero, M., Maiorca, V., Moschella, L., Rodolà, E., &#38; Locatello,
    F. (n.d.). ASIF: Coupled data turns unimodal models to multimodal without training.
    <i>arXiv</i>. <a href="https://doi.org/10.48550/arXiv.2210.01738">https://doi.org/10.48550/arXiv.2210.01738</a>'
  chicago: 'Norelli, Antonio, Marco Fumero, Valentino Maiorca, Luca Moschella, Emanuele
    Rodolà, and Francesco Locatello. “ASIF: Coupled Data Turns Unimodal Models to
    Multimodal without Training.” <i>ArXiv</i>, n.d. <a href="https://doi.org/10.48550/arXiv.2210.01738">https://doi.org/10.48550/arXiv.2210.01738</a>.'
  ieee: 'A. Norelli, M. Fumero, V. Maiorca, L. Moschella, E. Rodolà, and F. Locatello,
    “ASIF: Coupled data turns unimodal models to multimodal without training,” <i>arXiv</i>.
    .'
  ista: 'Norelli A, Fumero M, Maiorca V, Moschella L, Rodolà E, Locatello F. ASIF:
    Coupled data turns unimodal models to multimodal without training. arXiv, 2210.01738.'
  mla: 'Norelli, Antonio, et al. “ASIF: Coupled Data Turns Unimodal Models to Multimodal
    without Training.” <i>ArXiv</i>, 2210.01738, doi:<a href="https://doi.org/10.48550/arXiv.2210.01738">10.48550/arXiv.2210.01738</a>.'
  short: A. Norelli, M. Fumero, V. Maiorca, L. Moschella, E. Rodolà, F. Locatello,
    ArXiv (n.d.).
date_created: 2023-08-22T14:22:04Z
date_published: 2022-10-04T00:00:00Z
date_updated: 2024-02-12T09:57:14Z
day: '04'
department:
- _id: FrLo
doi: 10.48550/arXiv.2210.01738
external_id:
  arxiv:
  - '2210.01738'
language:
- iso: eng
main_file_link:
- open_access: '1'
  url: https://doi.org/10.48550/arXiv.2210.01738
month: '10'
oa: 1
oa_version: Preprint
publication: arXiv
publication_status: submitted
status: public
title: 'ASIF: Coupled data turns unimodal models to multimodal without training'
type: preprint
user_id: 2DF688A6-F248-11E8-B48F-1D18A9856A87
year: '2022'
...