Near, far: Patch-ordering enhances vision foundation models' scene understanding

Pariza V, Salehi M, Burghouts G, Locatello F, Asano YM. 2025. Near, far: Patch-ordering enhances vision foundation models’ scene understanding. 13th International Conference on Learning Representations. ICLR: International Conference on Learning Representations, 72303–72330.

Download
OA 2025_ICLR_Pariza.pdf 37.79 MB [Published Version]
Conference Paper | Published | English

Scopus indexed
Author
Pariza, Valentinos; Salehi, Mohammadreza; Burghouts, Gertjan; Locatello, FrancescoISTA ; Asano, Yuki M.
Department
Abstract
We introduce NeCo: Patch Neighbor Consistency, a novel self-supervised training loss that enforces patch-level nearest neighbor consistency across a student and teacher model. Compared to contrastive approaches that only yield binary learning signals, i.e. "attract" and "repel", this approach benefits from the more fine-grained learning signal of sorting spatially dense features relative to reference patches. Our method leverages differentiable sorting applied on top of pretrained representations, such as DINOv2-registers to bootstrap the learning signal and further improve upon them. This dense post-pretraining leads to superior performance across various models and datasets, despite requiring only 19 hours on a single GPU. This method generates high-quality dense feature encoders and establishes several new state-of-the-art results such as +2.3 % and +4.2% for non-parametric in-context semantic segmentation on ADE20k and Pascal VOC, +1.6% and +4.8% for linear segmentation evaluations on COCO-Things and -Stuff and improvements in the 3D understanding of multi-view consistency on SPair-71k, by more than 1.5%.
Publishing Year
Date Published
2025-04-01
Proceedings Title
13th International Conference on Learning Representations
Publisher
ICLR
Page
72303-72330
Conference
ICLR: International Conference on Learning Representations
Conference Location
Singapore, Singapore
Conference Date
2025-04-24 – 2025-04-28
IST-REx-ID

Cite this

Pariza V, Salehi M, Burghouts G, Locatello F, Asano YM. Near, far: Patch-ordering enhances vision foundation models’ scene understanding. In: 13th International Conference on Learning Representations. ICLR; 2025:72303-72330.
Pariza, V., Salehi, M., Burghouts, G., Locatello, F., & Asano, Y. M. (2025). Near, far: Patch-ordering enhances vision foundation models’ scene understanding. In 13th International Conference on Learning Representations (pp. 72303–72330). Singapore, Singapore: ICLR.
Pariza, Valentinos, Mohammadreza Salehi, Gertjan Burghouts, Francesco Locatello, and Yuki M. Asano. “Near, Far: Patch-Ordering Enhances Vision Foundation Models’ Scene Understanding.” In 13th International Conference on Learning Representations, 72303–30. ICLR, 2025.
V. Pariza, M. Salehi, G. Burghouts, F. Locatello, and Y. M. Asano, “Near, far: Patch-ordering enhances vision foundation models’ scene understanding,” in 13th International Conference on Learning Representations, Singapore, Singapore, 2025, pp. 72303–72330.
Pariza V, Salehi M, Burghouts G, Locatello F, Asano YM. 2025. Near, far: Patch-ordering enhances vision foundation models’ scene understanding. 13th International Conference on Learning Representations. ICLR: International Conference on Learning Representations, 72303–72330.
Pariza, Valentinos, et al. “Near, Far: Patch-Ordering Enhances Vision Foundation Models’ Scene Understanding.” 13th International Conference on Learning Representations, ICLR, 2025, pp. 72303–30.
All files available under the following license(s):
Creative Commons Attribution 4.0 International Public License (CC-BY 4.0):
Main File(s)
File Name
Access Level
OA Open Access
Date Uploaded
2025-08-04
MD5 Checksum
ddbe981f3ad3f6cb6daf12c954822eb8


Export

Marked Publications

Open Data ISTA Research Explorer

Sources

arXiv 2408.11054

Search this title in

Google Scholar
ISBN Search