{"year":"2022","oa_version":"Preprint","department":[{"_id":"FrLo"}],"abstract":[{"lang":"eng","text":"CLIP proved that aligning visual and language spaces is key to solving many vision tasks without explicit training, but required to train image and text encoders from scratch on a huge dataset. LiT improved this by only training the text encoder and using a pre-trained vision network. In this paper, we show that a common space can be created without any training at all, using single-domain encoders (trained with or without supervision) and a much smaller amount of image-text pairs. Furthermore, our model has unique properties. Most notably, deploying a new version with updated training samples can be done in a matter of seconds. Additionally, the representations in the common space are easily interpretable as every dimension corresponds to the similarity of the input to a unique entry in the multimodal dataset. Experiments on standard zero-shot visual benchmarks demonstrate the typical transfer ability of image-text models. Overall, our method represents a simple yet surprisingly strong baseline for foundation multi-modal models, raising important questions on their data efficiency and on the role of retrieval in machine learning."}],"author":[{"last_name":"Norelli","full_name":"Norelli, Antonio","first_name":"Antonio"},{"first_name":"Marco","full_name":"Fumero, Marco","last_name":"Fumero"},{"last_name":"Maiorca","full_name":"Maiorca, Valentino","first_name":"Valentino"},{"full_name":"Moschella, Luca","first_name":"Luca","last_name":"Moschella"},{"last_name":"Rodolà","first_name":"Emanuele","full_name":"Rodolà, Emanuele"},{"first_name":"Francesco","full_name":"Locatello, Francesco","orcid":"0000-0002-4850-0683","id":"26cfd52f-2483-11ee-8040-88983bcc06d4","last_name":"Locatello"}],"oa":1,"date_published":"2022-10-04T00:00:00Z","article_processing_charge":"No","publication_status":"submitted","user_id":"2DF688A6-F248-11E8-B48F-1D18A9856A87","article_number":"2210.01738","month":"10","date_created":"2023-08-22T14:22:04Z","main_file_link":[{"url":"https://doi.org/10.48550/arXiv.2210.01738","open_access":"1"}],"_id":"14216","publication":"arXiv","date_updated":"2024-02-12T09:57:14Z","day":"04","status":"public","title":"ASIF: Coupled data turns unimodal models to multimodal without training","type":"preprint","doi":"10.48550/arXiv.2210.01738","citation":{"short":"A. Norelli, M. Fumero, V. Maiorca, L. Moschella, E. Rodolà, F. Locatello, ArXiv (n.d.).","ista":"Norelli A, Fumero M, Maiorca V, Moschella L, Rodolà E, Locatello F. ASIF: Coupled data turns unimodal models to multimodal without training. arXiv, 2210.01738.","chicago":"Norelli, Antonio, Marco Fumero, Valentino Maiorca, Luca Moschella, Emanuele Rodolà, and Francesco Locatello. “ASIF: Coupled Data Turns Unimodal Models to Multimodal without Training.” <i>ArXiv</i>, n.d. <a href=\"https://doi.org/10.48550/arXiv.2210.01738\">https://doi.org/10.48550/arXiv.2210.01738</a>.","mla":"Norelli, Antonio, et al. “ASIF: Coupled Data Turns Unimodal Models to Multimodal without Training.” <i>ArXiv</i>, 2210.01738, doi:<a href=\"https://doi.org/10.48550/arXiv.2210.01738\">10.48550/arXiv.2210.01738</a>.","ieee":"A. Norelli, M. Fumero, V. Maiorca, L. Moschella, E. Rodolà, and F. Locatello, “ASIF: Coupled data turns unimodal models to multimodal without training,” <i>arXiv</i>. .","ama":"Norelli A, Fumero M, Maiorca V, Moschella L, Rodolà E, Locatello F. ASIF: Coupled data turns unimodal models to multimodal without training. <i>arXiv</i>. doi:<a href=\"https://doi.org/10.48550/arXiv.2210.01738\">10.48550/arXiv.2210.01738</a>","apa":"Norelli, A., Fumero, M., Maiorca, V., Moschella, L., Rodolà, E., &#38; Locatello, F. (n.d.). ASIF: Coupled data turns unimodal models to multimodal without training. <i>arXiv</i>. <a href=\"https://doi.org/10.48550/arXiv.2210.01738\">https://doi.org/10.48550/arXiv.2210.01738</a>"},"external_id":{"arxiv":["2210.01738"]},"language":[{"iso":"eng"}]}