{"oa_version":"Preprint","title":"ASIF: Coupled data turns unimodal models to multimodal without training","date_created":"2023-08-22T14:22:04Z","publication_status":"submitted","publication":"arXiv","year":"2022","oa":1,"date_published":"2022-10-04T00:00:00Z","article_processing_charge":"No","main_file_link":[{"url":"https://doi.org/10.48550/arXiv.2210.01738","open_access":"1"}],"_id":"14216","language":[{"iso":"eng"}],"abstract":[{"lang":"eng","text":"CLIP proved that aligning visual and language spaces is key to solving many vision tasks without explicit training, but required to train image and text encoders from scratch on a huge dataset. LiT improved this by only training the text encoder and using a pre-trained vision network. In this paper, we show that a common space can be created without any training at all, using single-domain encoders (trained with or without supervision) and a much smaller amount of image-text pairs. Furthermore, our model has unique properties. Most notably, deploying a new version with updated training samples can be done in a matter of seconds. Additionally, the representations in the common space are easily interpretable as every dimension corresponds to the similarity of the input to a unique entry in the multimodal dataset. Experiments on standard zero-shot visual benchmarks demonstrate the typical transfer ability of image-text models. Overall, our method represents a simple yet surprisingly strong baseline for foundation multi-modal models, raising important questions on their data efficiency and on the role of retrieval in machine learning."}],"month":"10","article_number":"2210.01738","citation":{"ista":"Norelli A, Fumero M, Maiorca V, Moschella L, Rodolà E, Locatello F. ASIF: Coupled data turns unimodal models to multimodal without training. arXiv, 2210.01738.","ieee":"A. Norelli, M. Fumero, V. Maiorca, L. Moschella, E. Rodolà, and F. Locatello, “ASIF: Coupled data turns unimodal models to multimodal without training,” arXiv. .","ama":"Norelli A, Fumero M, Maiorca V, Moschella L, Rodolà E, Locatello F. ASIF: Coupled data turns unimodal models to multimodal without training. arXiv. doi:10.48550/arXiv.2210.01738","short":"A. Norelli, M. Fumero, V. Maiorca, L. Moschella, E. Rodolà, F. Locatello, ArXiv (n.d.).","chicago":"Norelli, Antonio, Marco Fumero, Valentino Maiorca, Luca Moschella, Emanuele Rodolà, and Francesco Locatello. “ASIF: Coupled Data Turns Unimodal Models to Multimodal without Training.” ArXiv, n.d. https://doi.org/10.48550/arXiv.2210.01738.","mla":"Norelli, Antonio, et al. “ASIF: Coupled Data Turns Unimodal Models to Multimodal without Training.” ArXiv, 2210.01738, doi:10.48550/arXiv.2210.01738.","apa":"Norelli, A., Fumero, M., Maiorca, V., Moschella, L., Rodolà, E., & Locatello, F. (n.d.). ASIF: Coupled data turns unimodal models to multimodal without training. arXiv. https://doi.org/10.48550/arXiv.2210.01738"},"department":[{"_id":"FrLo"}],"status":"public","author":[{"last_name":"Norelli","full_name":"Norelli, Antonio","first_name":"Antonio"},{"full_name":"Fumero, Marco","last_name":"Fumero","first_name":"Marco"},{"last_name":"Maiorca","full_name":"Maiorca, Valentino","first_name":"Valentino"},{"full_name":"Moschella, Luca","last_name":"Moschella","first_name":"Luca"},{"full_name":"Rodolà, Emanuele","last_name":"Rodolà","first_name":"Emanuele"},{"last_name":"Locatello","first_name":"Francesco","full_name":"Locatello, Francesco","id":"26cfd52f-2483-11ee-8040-88983bcc06d4","orcid":"0000-0002-4850-0683"}],"doi":"10.48550/arXiv.2210.01738","date_updated":"2024-02-12T09:57:14Z","day":"04","user_id":"2DF688A6-F248-11E8-B48F-1D18A9856A87","external_id":{"arxiv":["2210.01738"]},"type":"preprint"}