{"user_id":"2DF688A6-F248-11E8-B48F-1D18A9856A87","month":"09","_id":"14962","article_processing_charge":"No","language":[{"iso":"eng"}],"publication_status":"submitted","author":[{"last_name":"Fan","full_name":"Fan, Ke","first_name":"Ke"},{"last_name":"Bai","first_name":"Zechen","full_name":"Bai, Zechen"},{"full_name":"Xiao, Tianjun","first_name":"Tianjun","last_name":"Xiao"},{"last_name":"Zietlow","first_name":"Dominik","full_name":"Zietlow, Dominik"},{"last_name":"Horn","first_name":"Max","full_name":"Horn, Max"},{"full_name":"Zhao, Zixu","first_name":"Zixu","last_name":"Zhao"},{"last_name":"Carl-Johann Simon-Gabriel","first_name":"Carl-Johann Simon-Gabriel","full_name":"Carl-Johann Simon-Gabriel, Carl-Johann Simon-Gabriel"},{"last_name":"Shou","full_name":"Shou, Mike Zheng","first_name":"Mike Zheng"},{"last_name":"Locatello","orcid":"0000-0002-4850-0683","id":"26cfd52f-2483-11ee-8040-88983bcc06d4","full_name":"Locatello, Francesco","first_name":"Francesco"},{"first_name":"Bernt","full_name":"Schiele, Bernt","last_name":"Schiele"},{"first_name":"Thomas","full_name":"Brox, Thomas","last_name":"Brox"},{"full_name":"Zhang, Zheng","first_name":"Zheng","last_name":"Zhang"},{"last_name":"Fu","full_name":"Fu, Yanwei","first_name":"Yanwei"},{"last_name":"He","full_name":"He, Tong","first_name":"Tong"}],"main_file_link":[{"open_access":"1","url":"https://doi.org/10.48550/arXiv.2309.09858"}],"date_updated":"2024-02-12T10:12:22Z","type":"preprint","article_number":"2309.09858","status":"public","oa":1,"doi":"10.48550/arXiv.2309.09858","extern":"1","day":"18","abstract":[{"lang":"eng","text":"In this paper, we show that recent advances in video representation learning\r\nand pre-trained vision-language models allow for substantial improvements in\r\nself-supervised video object localization. We propose a method that first\r\nlocalizes objects in videos via a slot attention approach and then assigns text\r\nto the obtained slots. The latter is achieved by an unsupervised way to read\r\nlocalized semantic information from the pre-trained CLIP model. The resulting\r\nvideo object localization is entirely unsupervised apart from the implicit\r\nannotation contained in CLIP, and it is effectively the first unsupervised\r\napproach that yields good results on regular video benchmarks."}],"external_id":{"arxiv":["2309.09858"]},"year":"2023","date_published":"2023-09-18T00:00:00Z","title":"Unsupervised open-vocabulary object localization in videos","date_created":"2024-02-08T15:33:39Z","oa_version":"Preprint","department":[{"_id":"FrLo"}],"publication":"arXiv","citation":{"chicago":"Fan, Ke, Zechen Bai, Tianjun Xiao, Dominik Zietlow, Max Horn, Zixu Zhao, Carl-Johann Simon-Gabriel Carl-Johann Simon-Gabriel, et al. “Unsupervised Open-Vocabulary Object Localization in Videos.” ArXiv, n.d. https://doi.org/10.48550/arXiv.2309.09858.","short":"K. Fan, Z. Bai, T. Xiao, D. Zietlow, M. Horn, Z. Zhao, C.-J.S.-G. Carl-Johann Simon-Gabriel, M.Z. Shou, F. Locatello, B. Schiele, T. Brox, Z. Zhang, Y. Fu, T. He, ArXiv (n.d.).","ieee":"K. Fan et al., “Unsupervised open-vocabulary object localization in videos,” arXiv. .","ista":"Fan K, Bai Z, Xiao T, Zietlow D, Horn M, Zhao Z, Carl-Johann Simon-Gabriel C-JS-G, Shou MZ, Locatello F, Schiele B, Brox T, Zhang Z, Fu Y, He T. Unsupervised open-vocabulary object localization in videos. arXiv, 2309.09858.","ama":"Fan K, Bai Z, Xiao T, et al. Unsupervised open-vocabulary object localization in videos. arXiv. doi:10.48550/arXiv.2309.09858","mla":"Fan, Ke, et al. “Unsupervised Open-Vocabulary Object Localization in Videos.” ArXiv, 2309.09858, doi:10.48550/arXiv.2309.09858.","apa":"Fan, K., Bai, Z., Xiao, T., Zietlow, D., Horn, M., Zhao, Z., … He, T. (n.d.). Unsupervised open-vocabulary object localization in videos. arXiv. https://doi.org/10.48550/arXiv.2309.09858"}}