{"month":"10","status":"public","doi":"10.48550/arXiv.2210.01738","day":"04","user_id":"2DF688A6-F248-11E8-B48F-1D18A9856A87","oa_version":"Preprint","external_id":{"arxiv":["2210.01738"]},"corr_author":"1","publication":"arXiv","type":"preprint","citation":{"chicago":"Norelli, Antonio, Marco Fumero, Valentino Maiorca, Luca Moschella, Emanuele Rodolà, and Francesco Locatello. “ASIF: Coupled Data Turns Unimodal Models to Multimodal without Training.” ArXiv, n.d. https://doi.org/10.48550/arXiv.2210.01738.","apa":"Norelli, A., Fumero, M., Maiorca, V., Moschella, L., Rodolà, E., & Locatello, F. (n.d.). ASIF: Coupled data turns unimodal models to multimodal without training. arXiv. https://doi.org/10.48550/arXiv.2210.01738","ista":"Norelli A, Fumero M, Maiorca V, Moschella L, Rodolà E, Locatello F. ASIF: Coupled data turns unimodal models to multimodal without training. arXiv, 2210.01738.","ieee":"A. Norelli, M. Fumero, V. Maiorca, L. Moschella, E. Rodolà, and F. Locatello, “ASIF: Coupled data turns unimodal models to multimodal without training,” arXiv. .","short":"A. Norelli, M. Fumero, V. Maiorca, L. Moschella, E. Rodolà, F. Locatello, ArXiv (n.d.).","ama":"Norelli A, Fumero M, Maiorca V, Moschella L, Rodolà E, Locatello F. ASIF: Coupled data turns unimodal models to multimodal without training. arXiv. doi:10.48550/arXiv.2210.01738","mla":"Norelli, Antonio, et al. “ASIF: Coupled Data Turns Unimodal Models to Multimodal without Training.” ArXiv, 2210.01738, doi:10.48550/arXiv.2210.01738."},"article_number":"2210.01738","article_processing_charge":"No","title":"ASIF: Coupled data turns unimodal models to multimodal without training","year":"2022","_id":"14216","date_created":"2023-08-22T14:22:04Z","abstract":[{"lang":"eng","text":"CLIP proved that aligning visual and language spaces is key to solving many vision tasks without explicit training, but required to train image and text encoders from scratch on a huge dataset. LiT improved this by only training the text encoder and using a pre-trained vision network. In this paper, we show that a common space can be created without any training at all, using single-domain encoders (trained with or without supervision) and a much smaller amount of image-text pairs. Furthermore, our model has unique properties. Most notably, deploying a new version with updated training samples can be done in a matter of seconds. Additionally, the representations in the common space are easily interpretable as every dimension corresponds to the similarity of the input to a unique entry in the multimodal dataset. Experiments on standard zero-shot visual benchmarks demonstrate the typical transfer ability of image-text models. Overall, our method represents a simple yet surprisingly strong baseline for foundation multi-modal models, raising important questions on their data efficiency and on the role of retrieval in machine learning."}],"date_published":"2022-10-04T00:00:00Z","language":[{"iso":"eng"}],"date_updated":"2024-10-09T21:06:50Z","oa":1,"publication_status":"submitted","author":[{"full_name":"Norelli, Antonio","last_name":"Norelli","first_name":"Antonio"},{"first_name":"Marco","last_name":"Fumero","full_name":"Fumero, Marco"},{"full_name":"Maiorca, Valentino","last_name":"Maiorca","first_name":"Valentino"},{"full_name":"Moschella, Luca","last_name":"Moschella","first_name":"Luca"},{"last_name":"Rodolà","full_name":"Rodolà, Emanuele","first_name":"Emanuele"},{"id":"26cfd52f-2483-11ee-8040-88983bcc06d4","first_name":"Francesco","orcid":"0000-0002-4850-0683","last_name":"Locatello","full_name":"Locatello, Francesco"}],"department":[{"_id":"FrLo"}],"main_file_link":[{"open_access":"1","url":"https://doi.org/10.48550/arXiv.2210.01738"}]}