Unsupervised open-vocabulary object localization in videos
Fan K, Bai Z, Xiao T, Zietlow D, Horn M, Zhao Z, Carl-Johann Simon-Gabriel C-JS-G, Shou MZ, Locatello F, Schiele B, Brox T, Zhang Z, Fu Y, He T. Unsupervised open-vocabulary object localization in videos. arXiv, 2309.09858.
Download (ext.)
https://doi.org/10.48550/arXiv.2309.09858
[Preprint]
Preprint
| Submitted
| English
Author
Fan, Ke;
Bai, Zechen;
Xiao, Tianjun;
Zietlow, Dominik;
Horn, Max;
Zhao, Zixu;
Carl-Johann Simon-Gabriel, Carl-Johann Simon-Gabriel;
Shou, Mike Zheng;
Locatello, FrancescoISTA ;
Schiele, Bernt;
Brox, Thomas;
Zhang, Zheng
All
All
Department
Abstract
In this paper, we show that recent advances in video representation learning
and pre-trained vision-language models allow for substantial improvements in
self-supervised video object localization. We propose a method that first
localizes objects in videos via a slot attention approach and then assigns text
to the obtained slots. The latter is achieved by an unsupervised way to read
localized semantic information from the pre-trained CLIP model. The resulting
video object localization is entirely unsupervised apart from the implicit
annotation contained in CLIP, and it is effectively the first unsupervised
approach that yields good results on regular video benchmarks.
Publishing Year
Date Published
2023-09-18
Journal Title
arXiv
Article Number
2309.09858
IST-REx-ID
Cite this
Fan K, Bai Z, Xiao T, et al. Unsupervised open-vocabulary object localization in videos. arXiv. doi:10.48550/arXiv.2309.09858
Fan, K., Bai, Z., Xiao, T., Zietlow, D., Horn, M., Zhao, Z., … He, T. (n.d.). Unsupervised open-vocabulary object localization in videos. arXiv. https://doi.org/10.48550/arXiv.2309.09858
Fan, Ke, Zechen Bai, Tianjun Xiao, Dominik Zietlow, Max Horn, Zixu Zhao, Carl-Johann Simon-Gabriel Carl-Johann Simon-Gabriel, et al. “Unsupervised Open-Vocabulary Object Localization in Videos.” ArXiv, n.d. https://doi.org/10.48550/arXiv.2309.09858.
K. Fan et al., “Unsupervised open-vocabulary object localization in videos,” arXiv. .
Fan K, Bai Z, Xiao T, Zietlow D, Horn M, Zhao Z, Carl-Johann Simon-Gabriel C-JS-G, Shou MZ, Locatello F, Schiele B, Brox T, Zhang Z, Fu Y, He T. Unsupervised open-vocabulary object localization in videos. arXiv, 2309.09858.
Fan, Ke, et al. “Unsupervised Open-Vocabulary Object Localization in Videos.” ArXiv, 2309.09858, doi:10.48550/arXiv.2309.09858.
All files available under the following license(s):
Copyright Statement:
This Item is protected by copyright and/or related rights. [...]
Link(s) to Main File(s)
Access Level
Open Access
Export
Marked PublicationsOpen Data ISTA Research Explorer
Sources
arXiv 2309.09858