---
OA_type: green
_id: '14216'
abstract:
- lang: eng
  text: CLIP proved that aligning visual and language spaces is key to solving many
    vision tasks without explicit training, but required to train image and text encoders
    from scratch on a huge dataset. LiT improved this by only training the text encoder
    and using a pre-trained vision network. In this paper, we show that a common space
    can be created without any training at all, using single-domain encoders (trained
    with or without supervision) and a much smaller amount of image-text pairs. Furthermore,
    our model has unique properties. Most notably, deploying a new version with updated
    training samples can be done in a matter of seconds. Additionally, the representations
    in the common space are easily interpretable as every dimension corresponds to
    the similarity of the input to a unique entry in the multimodal dataset. Experiments
    on standard zero-shot visual benchmarks demonstrate the typical transfer ability
    of image-text models. Overall, our method represents a simple yet surprisingly
    strong baseline for foundation multi-modal models, raising important questions
    on their data efficiency and on the role of retrieval in machine learning.
acknowledgement: "AN, MF, and FL partially worked on ASIF when they were at Amazon
  Web Services in Tübingen,\r\nGermany. This paper is financially supported by the
  PRIN 2020 project no.2020TA3K9N (LEGO.AI), PNRR MUR project PE0000013-FAIR, and
  ERC Grant no.802554 (SPECGEO)."
alternative_title:
- Advances in Neural Information Processing Systems
article_processing_charge: No
arxiv: 1
author:
- first_name: Antonio
  full_name: Norelli, Antonio
  last_name: Norelli
- first_name: Marco
  full_name: Fumero, Marco
  last_name: Fumero
- first_name: Valentino
  full_name: Maiorca, Valentino
  last_name: Maiorca
- first_name: Luca
  full_name: Moschella, Luca
  last_name: Moschella
- first_name: Emanuele
  full_name: Rodolà, Emanuele
  last_name: Rodolà
- first_name: Francesco
  full_name: Locatello, Francesco
  id: 26cfd52f-2483-11ee-8040-88983bcc06d4
  last_name: Locatello
  orcid: 0000-0002-4850-0683
citation:
  ama: 'Norelli A, Fumero M, Maiorca V, Moschella L, Rodolà E, Locatello F. ASIF:
    Coupled data turns unimodal models to multimodal without training. In: <i>37th
    Conference on Neural Information Processing Systems</i>. Vol 36. Neural Information
    Processing Systems Foundation; 2023:15303-15319.'
  apa: 'Norelli, A., Fumero, M., Maiorca, V., Moschella, L., Rodolà, E., &#38; Locatello,
    F. (2023). ASIF: Coupled data turns unimodal models to multimodal without training.
    In <i>37th Conference on Neural Information Processing Systems</i> (Vol. 36, pp.
    15303–15319). New Orleans, LA, United States: Neural Information Processing Systems
    Foundation.'
  chicago: 'Norelli, Antonio, Marco Fumero, Valentino Maiorca, Luca Moschella, Emanuele
    Rodolà, and Francesco Locatello. “ASIF: Coupled Data Turns Unimodal Models to
    Multimodal without Training.” In <i>37th Conference on Neural Information Processing
    Systems</i>, 36:15303–19. Neural Information Processing Systems Foundation, 2023.'
  ieee: 'A. Norelli, M. Fumero, V. Maiorca, L. Moschella, E. Rodolà, and F. Locatello,
    “ASIF: Coupled data turns unimodal models to multimodal without training,” in
    <i>37th Conference on Neural Information Processing Systems</i>, New Orleans,
    LA, United States, 2023, vol. 36, pp. 15303–15319.'
  ista: 'Norelli A, Fumero M, Maiorca V, Moschella L, Rodolà E, Locatello F. 2023.
    ASIF: Coupled data turns unimodal models to multimodal without training. 37th
    Conference on Neural Information Processing Systems. NeurIPS: Neural Information
    Processing Systems, Advances in Neural Information Processing Systems, vol. 36,
    15303–15319.'
  mla: 'Norelli, Antonio, et al. “ASIF: Coupled Data Turns Unimodal Models to Multimodal
    without Training.” <i>37th Conference on Neural Information Processing Systems</i>,
    vol. 36, Neural Information Processing Systems Foundation, 2023, pp. 15303–19.'
  short: A. Norelli, M. Fumero, V. Maiorca, L. Moschella, E. Rodolà, F. Locatello,
    in:, 37th Conference on Neural Information Processing Systems, Neural Information
    Processing Systems Foundation, 2023, pp. 15303–15319.
conference:
  end_date: 2023-12-14
  location: New Orleans, LA, United States
  name: 'NeurIPS: Neural Information Processing Systems'
  start_date: 2023-12-12
corr_author: '1'
date_created: 2023-08-22T14:22:04Z
date_published: 2023-10-04T00:00:00Z
date_updated: 2025-05-14T11:28:52Z
day: '04'
ddc:
- '000'
department:
- _id: FrLo
external_id:
  arxiv:
  - '2210.01738'
file:
- access_level: open_access
  checksum: e51c90300b92d7135050da5c9e3a8015
  content_type: application/pdf
  creator: dernst
  date_created: 2025-02-04T12:16:13Z
  date_updated: 2025-02-04T12:16:13Z
  file_id: '18994'
  file_name: 2023_NeurIPS_Fumero.pdf
  file_size: 12648978
  relation: main_file
  success: 1
file_date_updated: 2025-02-04T12:16:13Z
has_accepted_license: '1'
intvolume: '        36'
language:
- iso: eng
month: '10'
oa: 1
oa_version: Preprint
page: 15303-15319
publication: 37th Conference on Neural Information Processing Systems
publication_identifier:
  isbn:
  - '9781713899921'
publication_status: published
publisher: Neural Information Processing Systems Foundation
quality_controlled: '1'
related_material:
  link:
  - relation: software
    url: https://github.com/noranta4/ASIF
status: public
title: 'ASIF: Coupled data turns unimodal models to multimodal without training'
type: conference
user_id: 2DF688A6-F248-11E8-B48F-1D18A9856A87
volume: 36
year: '2023'
...
